Building OAI-PMH Harvesters With Net::OAI::Harvester
Net::OAI::Harvester is a Perl package for easily interacting with OAI-PMH repositories as a metadata harvester. The article provides examples of how to use Net::OAI::Harvester to write short programs that execute each of the 6 OAI-PMH verbs. Issues related to efficient XML parsing of OAI-PMH responses are discussed, as are specific techniques used by Net::OAI::Harvester.
The Open Archives Protocol for Metadata Harvesting (OAI-PMH) is an increasingly popular protocol for sharing metadata about digital objects. Part of the reason for this popularity is that the OAI-PMH helps to solve a common problem in a simple and flexible way using familiar technologies. The OAI-PMH is essentially a set of request/response messages which may be sent over HTTP to retrieve metadata that is encoded in XML. So if I am interested in the digital assets of a particular repository I can construct a familiar URL and get back an XML document containing the metadata I am interested in.
From a programming perspective there are several issues that arise when writing a OAI-PMH harvesting program: HTTP requests need to be URL-encoded for safe transmission; error conditions can arise which must be handled gracefully; resumption tokens may be used to break up a response into chunks; responses are XML which must be parsed in order to extract the data points that are of interest; and responses can be arbitrarily large at the whim of a given repository. Of greatest concern here is that all responses are arbitrarily large XML documents. While XML tools are available in all of today's major programming languages, they are generic tools which must be adapted to the particular needs of OAI-PMH responses. Furthermore some parsing techniques are more appropriate than others for parsing very large XML documents, and it may not be clear to the beginner which tool to use.
This article does not aim to describe the OAI-PMH in full (since it has been done well elsewhere [1]), or to detail the ins and outs of efficient XML parsing. Rather, it will examine the use of Net::OAI::Harvester, which is a toolkit for quickly building simple and efficient OAI-PMH harvesters. Net::OAI::Harvester is a Perl module that abstracts away all the details of generating the HTTP request, handling error conditions, and parsing XML so that extracted data can be easily used. It is hoped that this article will serve as a cook book for easily building OAI-PMH harvesters. The examples use real life OAI-PMH repositories, (mostly the American Memory repository at the Library of Congress).
Net::OAI::Harvester and Perl
Net::OAI::Harvester is an extension to the Perl programming language which can be found on the Comprehensive Perl Archive Network [2]. Perl is a widely used language that has come to live in many different environments: from systems administration, to relational database access, to World Wide Web applications, to genetic sequencing. The CPAN is a repository of free, reusable object-oriented components which extend Perl's core functionality to work in these (and many more) areas. Similarly Net::OAI::Harvester extends Perl so that you can easily write programs to interact with OAI-PMH repositories. Since Perl's database interface (DBI [3]) is able to talk to most of today's popular databases you can easily store the results of your harvesting in the database of your choice.
Assuming you have Perl installed you can install Net::OAI::Harvester with one command:
perl -MCPAN -e 'install Net::OAI::Harvester'
This command will retrieve the Net::OAI::Harvester module from CPAN, check to make sure that other dependencies are installed, run the test suite, and install the package. If you run into trouble please jump to the conclusion for more information about where to get help.
Please Identify Yourself
Consider this first example of asking a OAI-PMH repository to identify itself using the Identify verb, and finding the name of the repository. To send the request we construct this URL:
http://memory.loc.gov/cgi-bin/oai2_0?verb=Identify
which generates this XML as a response:
1 <?xml version="1.0" encoding="UTF-8"?> 2 <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" 3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 4 xsi:schemaLocation="http://www.w3.org/2001/XMLSchema-instance" 5 http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> 6 <responseDate>2003-12-07T21:44:24Z</responseDate> 7 <request verb="Identify">http://memory.loc.gov/cgi-bin/oai2_0 8 <Identify> 9 <repositoryName>Library of Congress Open Archive Initiative Repository 1</repositoryName> 10 <baseURL>http://memory.loc.gov/cgi-bin/oai2_0 11 <protocolVersion>2.0</protocolVersion> 12 <adminEmail>dwoo@loc.gov</adminEmail> 13 <adminEmail>caar@loc.gov</adminEmail> 14 <earliestDatestamp>2002-06-01T00:00:00Z</earliestDatestamp> 15 <deletedRecord>no</deletedRecord> 16 <granularity>YYYY-MM-DDThh:mm:ssZ</granularity> 17 <description> 18 <oai-identifier 19 xmlns="http://www.openarchives.org/OAI/2.0/oai-identifier" 20 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 21 xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-identifier 22 http://www.openarchives.org/OAI/2.0/oai-identifier.xsd"> 23 <scheme>oai</scheme> 24 <repositoryIdentifier>lcoa1.loc.gov</repositoryIdentifier> 25 <delimiter>:</delimiter> 26 <sampleIdentifier>oai:lcoa1.loc.gov:loc.music/musdi.002 </sampleIdentifier> 27 </oai-identifier> 28 </description> 29 <description> 30 <eprints xmlns="http://www.openarchives.org/OAI/1.1/eprints" 31 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 32 xsi:schemaLocation="http://www.openarchives.org/OAI/1.1/eprints 33 http://www.openarchives.org/OAI/1.1/eprints.xsd"> 34 <content> 35 <URL>http://memory.loc.gov/ammem/oamh/lcoa1_content.html 36 <text>Selected collections of digitized historical materials from the Library of Congress, including many from American Memory. Includes photographs, movies, maps, pamphlets and printed 37 ephemera, sheet music and books.</text> 38 </content> 39 <metadataPolicy/> 40 <dataPolicy/> 41 </eprints> 42 </description> 43 <description> 44 <branding xmlns="http://www.openarchives.org/OAI/2.0/branding/" 45 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 46 xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/branding/ 47 http://www.openarchives.org/OAI/2.0/branding.xsd"> 48 <collectionIcon> 49 <url>http://memory.loc.gov/ammem/oamh/lc-icon.gif </url> 50 <link>http://www.loc.gov</link> 51 <title>Library of Congress</title> 52 <width>100</width> 53 <height>35</height> 54 </collectionIcon> 55 <metadataRendering 56 metadataNamespace="http://www.loc.gov/MARC21/slim" 57 mimeType="text/xsl"> http://www.loc.gov/standards/marcxml/xslt/MARC21slim2HTML.xsl </metadataRendering> 58 </branding> 59 </description> 60 61 </Identify> 62 </OAI-PMH> [EXAMPLE 1]
You can find the repository name hiding on line 9. Now imagine that you want to write a program to perform the same action, and to extract the repository name from the XML response. With Net::OAI::Harvester this can be done in just a few lines of code:
1 use Net::OAI::Harvester; 2 3 my $harvester = Net::OAI::Harvester->new( 4 baseURL => 'http://memory.loc.gov/cgi-bin/oai2_0' 5 ); 6 7 my $identity = $harvester->identify(); 8 print $identity->repositoryName(),"\n"; OUTPUT: Library of Congress Open Archive Initiative Repository 1 [EXAMPLE 2]
On line 1 the Net::OAI::Harvester program is used which tells Perl to load the extension; lines 3-5 create a harvester object for the Library of Congress OAI-PMH repository; line 7 calls the Identify verb on the repository, which returns a Net::OAI::Identify object; line 8 prints out the repository name using the repositoryName()
method. The program is able call the Identify verb on the OAI-PMH repository, collect the response, and extract the repository name from the XML without having to make the HTTP request explicitly or to do any XML parsing. When the identify()
method is called on line 7 Net::OAI::Harvester generates the HTTP request, stores the response, parses the XML, and bundles up the extracted information in a Net::OAI::Identify object which is then returned.
The call to identify on line 7 is also important because it illustrates how all the 6 OAI-PMH verbs are implemented in Net::OAI::Harvester. All the verbs correspond to method names that can be called on a Net::OAI::Harvester object. When called each method returns a corresponding object which has the requested information bundled inside it. So just as identify()
returns a Net::OAI::Identify object, listMetadataFormats()
returns a Net::OAI::ListMetadataFormats object, getRecord()
returns a Net::OAI::Record object, listRecords()
returns a Net::OAI::ListRecords object, listIdentifiers()
returns a Net::OAI::ListIdentifiers object, and listSets()
returns a Net::OAI::ListSets object. To obtain the documentation for any of these modules you can issue the perldoc command on the package in question:
perldoc Net::OAI::Harvester
More Verbage
As was explained above the OAI-PMH supports sending all flavours of metadata as long as it can be expressed in XML. So a repository can make their metadata available as MARC21 using the Library of Congress schema [4], or as an EAD document [5]; however all repositories must provide a baseline Dublin Core support. The ListMetadataFormats()
OAI-PMH verb translates into the listMetadataFormats()
method which you can call on your Net::OAI::Harvester object.
1 use Net::OAI::Harvester; 2 3 my $harvester = Net::OAI::Harvester->new( 4 baseURL => 'http://memory.loc.gov/cgi-bin/oai2_0' 5 ); 6 7 my $metadataFormats = $harvester->listMetadataFormats(); 8 print join( ',', $metadataFormats->prefixes(), "\n"; OUTPUT: oai_dc,oai_marc,marc21,mods [EXAMPLE 3]
As you can see, much of this looks the same as example 1, lines 1-5 create a Net::OAI::Harvester object for the Library of Congress repository. The main difference is that the listMetadataFormats()
method is being called on line 7, and on line 8 prints out the metadata prefixes for each of the formats. These formats are important when it comes time to start retrieving metadata from a repository since they are used to specify what flavour of metadata you would like. The OAI-PMH allows you to query the repository to find out which metadata formats are supported for a specific record using the identifier parameter, which translates exactly into a listMetadataFormats()
parameter:
1 my $metadataFormats = $harvester->listMetadataFormats( 2 identifier => 'oai:lcoa1.loc.gov:loc.gmd/g3764p.pm003171' 3 ); [EXAMPLE 4]
A single OAI-PMH repository can be divided into groups or sets, which can be retrieved using the ListSets verb. The listSets()
method performs this action, which returns a Net::OAI::ListSets object that contains the set information for the repository in question.
1 use Net::OAI::Harvester; 2 3 my $harvester = Net::OAI::Harvester->new( 4 baseURL => 'http://arXiv.org/oai2' 5 ); 6 7 my $sets = $harvester->listSets(); 8 foreach my $spec ( $sets->setSpecs() ) { 9 print "$spec ==> ", $sets->setName($spec), "\n"; 10 } OUTPUT: cs ==> Computer Science math ==> Mathematics nlin ==> Nonlinear Sciences physics ==> Physics physics:acc-phys ==> Accelerator Physics physics:ao-sci ==> Atmospheric-Oceanic Sciences physics:astro-ph ==> Astrophysics physics:atom-ph ==> Atomic, Molecular and Optical Physics physics:bayes-an ==> Bayesian Analysis physics:chem-ph ==> Chemical Physics physics:cond-mat ==> Condensed Matter physics:gr-qc ==> General Relativity and Quantum Cosmology physics:hep-ex ==> High Energy Physics - Experiment physics:hep-lat ==> High Energy Physics - Lattice physics:hep-ph ==> High Energy Physics - Phenomenology physics:hep-th ==> High Energy Physics - Theory physics:math-ph ==> Mathematical Physics physics:mtrl-th ==> Materials Theory physics:nucl-ex ==> Nuclear Experiment physics:nucl-th ==> Nuclear Theory physics:phys-lib ==> Physics "Library" physics:physics ==> Physics (Other) physics:plasm-ph ==> Plasma Physics physics:quant-ph ==> Quantum Physics physics:supr-con ==> Superconductivity q-bio ==> Quantitative Biology [EXAMPLE 5]
Lines 1-5 create a Net::OAI::Harvester object to target the arXiv OAI-PMH preprint archive. Line 7 calls the listSets()
verb on the harvester object which returns a Net::OAI::ListSets object. On lines 8-10 we then iterate through the setSpecs (unique identifiers for each set), and print each out along with the full name for the set.
Headers and Records
All of this has been leading up to the main focus of the OAI-PMH: obtaining metadata. The OAI-PMH has three verbs which facilitate obtaining metadata from a repository: ListIdentifiers, ListRecords and GetRecord. Each of these verbs translates into a Net::OAI::Harvester method: listIdentifiers(), listRecords()
and getRecord(). The OAI-PMH defines an identifier as unambiguously identifying an item within a repository. Since metadata records can come in multiple formats, the identifier allows you to target the item you would like. The idea of the ListIdentifiers verb is that it allows a harvester to see what identifiers exist in the repository and to only request those that are of interest.
1 use Net::OAI::Harvester; 2 3 my $harvester = Net::OAI::Harvester->new( 4 baseURL => 'http://arXiv.org/oai2' 5 ); 6 7 my $list = $harvester->listIdentifiers( 8 metadataPrefix => 'oai_dc', 9 set => 'cs' 10 ); 11 12 while ( my $header = $list->next() ) { 13 print $header->identifier(),"\n"; 14 } OUTPUT: oai:arXiv.org:cmp-lg/9404001 oai:arXiv.org:cmp-lg/9404002 oai:arXiv.org:cmp-lg/9404003 oai:arXiv.org:cmp-lg/9404004 oai:arXiv.org:cmp-lg/9404005 oai:arXiv.org:cmp-lg/9404006 oai:arXiv.org:cmp-lg/9404007 oai:arXiv.org:cmp-lg/9404008 oai:arXiv.org:cmp-lg/9404011 oai:arXiv.org:cmp-lg/9405001 ... [EXAMPLE 6]
The output above is actually truncated to 10 rows from 3956, which illustrates the need for using an iterator in lines 12-14. The call to listIdentifiers()
on lines 7-9 includes the required metadataPrefix parameter which tells the repository that we are interested in records that are available as baseline Dublin Core; and an optional parameter set which indicates we are only interested in identifiers from the Computer Science set, (see example 5).
Assuming that our harvester is interested in retrieving identifier oai:arXiv.org:cmp-lg/9404001 as a Dublin Core record, it can issue a getRecord()
request.
1 use Net::OAI::Harvester; 2 3 my $harvester = Net::OAI::Harvester->new( 4 baseURL => 'http://arXiv.org/oai2' 5 ); 6 7 my $record = $harvester->getRecord( 8 identifier => 'oai:arXiv.org:cmp-lg/9404001', 9 metadataPrefix => 'oai_dc', 10 ); 11 12 my $metadata = $record->metadata(); 13 print $metadata->title(); OUTPUT: An Alternative Conception of Tree-Adjoining Derivation [EXAMPLE 7]
The call to getRecord()
on lines 7-9 requests a Dublin Core record for a specific item from the arXiv.org repository, which is returned as a Net::OAI::Record object. Since an OAI-PMH record is actually made up of the record header and metadata there are two corresponding methods you can call on your record object: header(), metadata(). On line 12 the metadata()
method is being used to get at the metadata, which returns a Net::OAI::Record::OAI_DC object in which all the metadata is stored. Finally on line 13 we print out the title. The description of what is going on in examples 6 and 7 is perhaps more complicated than the code itself, which illustrates how you can interact with an OAI-PMH server in just a few lines.
The final OAI-PMH verb to examine is ListRecords which acts somewhat like ListIdentifiers, but returns a list of records (Net::OAI::Record objects) rather than identifiers. For example if a harvester wanted to extract all the titles and urls from the American Memory LC Photographs collection:
1 use Net::OAI::Harvester; 2 3 my $harvester = Net::OAI::Harvester->new( 4 baseURL => 'http://memory.loc.gov/cgi-bin/oai2_0' 5 ); 6 7 my $list = $harvester->listRecords( 8 metadataPrefix => 'oai_dc', 9 set => 'lcphotos' 10 ); 11 12 while ( my $record = $list->next() ) { 13 my $metadata = $record->metadata(); 14 print "title: ",$metadata->title(),"\n"; 15 print "url: ",$metadata->identifier(),"\n\n"; 16 } OUTPUT: title: Washington Street, Sonora, Tuolumne County, California url: http://hdl.loc.gov/loc.pnp/cph.3a00517 title: Stanislaus Flour Mill and bridge at Knight's Ferry, Stanislaus County, California url: http://hdl.loc.gov/loc.pnp/cph.3a00518 title: Stanislaus Flour Mill from the bridge over the river at Knight's Ferry, Stanislaus County url: http://hdl.loc.gov/loc.pnp/cph.3a00519 title: Sonora - Tuolomne County - the Court House url: http://hdl.loc.gov/loc.pnp/cph.3a00520 title: The ram of MONITOR CAMANCHE, San Francisco url: http://hdl.loc.gov/loc.pnp/cph.3a01294 ... [EXAMPLE 8]
Again output here has been truncated, since the result of ListRecords()
can be arbitrarily large. Lines 12-16 show how to iterate over the records obtained by calling listRecords(), and how to extract the two fields of interest from the metadata.
Resumption Tokens
One issue that has been glossed over somewhat (until now) is that when an OAI-PMH server receives a ListRecords, ListIdentifiers or ListSets request, it may decide to deliver only a portion of the results, along with a resumption token, which may be used to issue a new request to get the remaining results. The OAI-PMH designers added this functionality to allow repositiory servers to avoid the cost of having to serve up huge documents in one fell swoop. From a harvesting perspective then it is really neccessary to check for a resumption token when performing a List action. The result of calling listRecords()
is a Net::OAI::ListRecords object which has a resumptionToken()
method. Using the Net::OAI::ResumptionToken object it is then possible to extract the token, and submit another listRecords()
call to get the remaining records.
1 use Net::OAI::Harvester; 2 3 my $harvester = Net::OAI::Harvester->new( 4 baseURL => 'http://memory.loc.gov/cgi-bin/oai2_0' 5 ); 6 7 my $records = $harvester->listRecords( 8 metadataPrefix => 'oai_dc', 9 set => 'lcphotos' 10 ); 11 my $finished = 0; 12 13 while ( ! $finished ) { 14 15 while ( my $record = $records->next() ) { 16 print $record->metadata()->title(),"\n"; 17 } 18 19 my $rToken = $records->resumptionToken(); 20 if ( $rToken ) { 21 $records = $harvester->listRecords( 22 resumptionToken => $rToken->token() 23 ); 24 } else { 25 $finished = 1; 26 } 27 28 } [EXAMPLE 9]
However in the spirit of making easy things easy and hard things possible, Net::OAI::Harvester has the methods getAllRecords()
and getAllIdentifiers()
which will automatically check for and handle resumption tokens. This means the above code can be rewritten as:
1 use Net::OAI::Harvester; 2 3 my $harvester = Net::OAI::Harvester->new( 4 baseURL => 'http://memory.loc.gov/cgi-bin/oai2_0' 5 ); 6 7 my $records = $harvester->listAllRecords( 8 metadataPrefix => 'oai_dc', 9 set => 'lcphotos' 10 ); 11 12 while ( my $record = $records->next() ) { 13 print $record->metadata()->title(); 14 } [EXAMPLE 10]
The use of listAllRecords()
on line 7 means that resumption tokens can be handled in half the amount of code, and is much more readable as a result.
A Quick Peek Inside
Since OAI-PMH responses to the ListRecords and ListIdentifiers can be arbitrarily large, Net::OAI::Harvester uses some special techniques to make sure that XML parsing is done in a memory- efficient way. There are two common approaches to parsing XML: building an in-memory data structure of the entire XML document (DOM), and stream-based where the XML is analysed as it is read in (SAX). Net::OAI::Harvester uses a stream-based parser (XML::SAX [6]), which allows the large documents to be read in without resorting to building a huge Document Object Model in memory.
As the large list responses are parsed, Net::OAI::Harvester builds Net::OAI::Record objects and freezes them on disk using the Storable [7] module. Then when the parsing is complete you get each of these objects back from disk with each call to next().
Custom Metadata Handlers
The true flexibility of the OAI-PMH lies in its ability to allow harvesters to retrieve metadata in any XML- based format. So if a site wants to offer metadata records encoded using the MARC21 schema or the Encoded Archival Description it can do so; but the repository must offer Dublin Core records as a lowest common denominator format. Net::OAI::Harvester includes the module Net::OAI::Record::OAI_DC which is an XML SAX handler for storing basline Dublin Core data that has been encoded using the OAI-PMH schema [8]. It gets called automatically when you are retrieving metadata.
In order to handle other types of metadata formats you can plug your own custom metadata handler into Net::OAI::Harvester. The handlers work like SAX filters, which are best described elsewhere [9]. Jumping ahead a little bit, if the custom handler has been created with the name MODS, it can be used with the listRecords()
call:
1 use Net::OAI::Harvester; 2 use MODS; 3 4 my $harvester = Net::OAI::Harvester->new( 5 baseURL => 'http://memory.loc.gov/cgi-bin/oai2_0' 6 ); 7 8 my $list = $harvester->listRecords( 9 metadataPrefix => 'mods', 10 set => 'lcphotos', 11 metadataHandler => 'MODS', 12 ); 13 14 while ( my $record = $list->next() ) { 15 print $record->metadata()->title(),"\n"; 16 }
On line 2, the custom metadata handler MODS is used (more about that below). Then on line 11 the metadataHandler parameter is used to tell the harvester object to employ that package name for creating the metadata objects. The rest of the listRecords()
usage stays the same. Below is an example of what a custom metadata handler would look like.
package MODS; use base qw( XML::SAX::Base ); sub new { my $self = { insideTitleInfo => 0, title => '' }; return( bless $self, 'MyHandler' ); } sub title { my $self = shift; return( $self->{ title } ); } ## SAX Methods sub start_element { my ( $self, $element ) = @_; if ( $element->{ Name } eq 'titleInfo' ) { $self->{ insideTitleInfo } = 1; } } sub end_element { my ( $self, $element ) = @_; if ( $element->{ Name } eq 'titleInfo' ) { $self->{ insideTitleInfo } = 0; } } sub characters { my ( $self, $chars ) = @_; if ( $self->{ insideTitleInfo } ) { $self->{ title } .= $chars->{ Data }; } }
Not for the light-hearted, but the idea is that it is possible to extend Net::OAI::Harvester to work with any XML-based metadata format. The intention is that as people develop plugins they can be added back into the Net::OAI::Harvester distrubution so that everyone can benefit from the work.
Conclusion
I hope this article has whetted your appetite for building OAI-PMH harvesters using the Net::OAI::Harvester toolkit. Knowing a bit about Perl beforehand is helpful, but not mandatory for getting basic usage out of Net::OAI::Harvester. In addition the perl4lib listserv [10] is a good place to learn more about Perl and using the Net::OAI::Harvester module.
References
- Special Issue on Open Archives Initiative Metadata Harvesting, Library Hi Tech, Volume 21, Number 2, 2003.
- Comprehensive Perl Archive Network (CPAN).
http://www.cpan.org/ - Database Interface (DBI).
http://search.cpan.org/perldoc?DBI - Metadata Object Description Schema (MODS).
http://www.loc.gov/standards/mods/ - Encoded Archival Description (EAD).
http://www.loc.gov/ead/ - XML::SAX. http://search.cpan.org/perldoc?XML::SAX
- Storable.
http://search.cpan.org/perldoc?Storable - OAI-PMH Dublin Core Schema.
http://www.openarchives.org/OAI/openarchivesprotocol.html#dublincore - Transforming XML with SAX Filters, Kip Hampton, XML.com, 10 October 2001
http://www.xml.com/pub/a/2001/10/10/sax-filters.html - perl4lib http://perl4lib.perl.org/