An Introduction to the Search/Retrieve URL Service (SRU)
This article is an introduction to the "brother and sister" Web Service protocols named Search/Retrieve Web Service (SRW) and Search/Retrieve URL Service (SRU) with an emphasis on the later. More specifically, the article outlines the problems SRW/U are intended to solve, the similarities and differences between SRW and SRU, the complimentary nature of the protocols with OAI-PMH, and how SRU is being employed in a sponsored NSF (National Science Foundation) grant called OCKHAM to facilitate an alerting service. The article is seasoned with a bit of XML and Perl code to illustrate the points. The canonical home page describing SRW/U [1] is also a useful starting point.
The Problems SRW and SRU are Intended to Solve
SRW and SRU are intended to define a standard form for Internet search queries as well as the structure of the responses. The shape of existing queries illustrates the problem. The following URLs are searches for 'dogs and cats' against three popular Internet search engines:
- http://www.google.com/search?hl=en&ie=ISO-8859-1&q=dogs+and+cats&btnG=Google+Search
- http://search.yahoo.com/search?fr=fp-pull-web-t&p=dogs+and+cats
- http://search.msn.com/results.aspx?FORM=MSNH&q=dogs%20and%20cats
Even though the queries are the same, the syntax implementing the queries is different. What is worse is the structure of the responses. Each response not only contains search results but lots of formatting as well. SRW and SRU address these shortcomings by specifying the syntax for queries and results. Such specifications open up Internet-accessible search functions and allow for the creation of tools to explore the content of the 'hidden Web' more effectively. SRW/U allow people and HTTP user agents to query Internet databases more seamlessly without the need of more expensive and complicated meta-search protocols.
SRW and SRU as Web Services
SRW/U are Web Services-based protocols for querying Internet indexes or databases and returning search results. Web Services essentially come in two flavours: REST (Representational State Transfer) and SOAP (Simple Object Access Protocol). A "REST-ful" Web Service usually encodes commands from a client to a server in the query string of a URL. Each name/value pair of the query string specifies a set of input parameters for the server. Once received, the server parses these name/value pairs, does some processing using them as input, and returns the results as an XML stream. The shape of the query string as well as the shape of the XML stream are dictated by the protocol. By definition, the communication process between the client and the server is facilitated over an HTTP connection.
OAI-PMH is an excellent example of a REST-ful Web Service. While OAI 'verbs' can be encoded as POST requests in an HTTP header, they are usually implemented as GET requests. These verbs act as commands for the OAI data repository which responds to them with a specific XML schema. "SOAP-ful" Web Services work in a similar manner, except the name/value pairs of SOAP requests are encoded in an XML SOAP 'envelope'. Similarly, SOAP servers return responses using the SOAP XML vocabulary. The biggest difference between REST-ful Web Services and SOAP requests is the transport mechanism. REST-ful Web Services are always transmitted via HTTP. SOAP requests and responses can be used by many other transport mechanisms including email, SSH (Secure Shell), telnet, as well as HTTP.
Web Services essentially send requests for information from a client to a server. The server reads the input, processes it, and returns the results as an XML stream back to the client. REST-ful Web Services encode the input usually in the shape of URLs. SOAP requests are marked up in a SOAP XML vocabulary. REST-ful Web Services return XML streams of varying shapes. SOAP Web Services return SOAP XML streams. Many people think implementing REST-ful Web Service is easier because it is simpler to implement and requires less overhead. SOAP is more robust and can be implemented in a larger number of networked environments.
SRW is a SOAP-ful Web Service. SRU is a REST-ful Web service. Despite the differences in implementation, they are really very similar since they both define a similar set of commands (known as "operations") and responses. Where OAI-PMH defines six 'verbs', SRW/U support three 'operations': explain, scan, and searchRetrieve. Like OAI, each operation is qualified with one or more additional name/value pairs.
Explain
Explain operations are requests sent by clients as a way of learning about the server's database/index as well as its functionality. At a minimum, responses to explain operations return the location of the database, a description of what the database contains, and what features of the protocol the server supports. Implemented in SRU, empty query strings on a URL are interpreted as an explain operation. When explicitly stated, a version parameter must be present. Therefore, at a minimum, explain operations can be implemented in one of two ways:
- http://example.org/
- http://example.org/?operation=explain&version=1.1
An example SRU response from an explain operation might look like the text below. It denotes:
- the server supports version 1.1 of the protocol
- records in this response are encoded in a specific DTD and records are encoded as XML within the entire XML stream
- the server can be found at a specific location
- the database is (very) briefly described
- the database supports only title searching
- the database returns records using Dublin Core
- the system will return no more than 9999 records at one time
<explainResponse> <version>1.1</version> <record> <recordSchema>http://explain.z3950.org/dtd/2.0/</recordSchema> <recordPacking>xml</recordPacking> <recordData> <explain> <serverInfo> <host>example.org</host> <port>80</port> <database>/</database> </serverInfo> <databaseInfo> <title>An example SRU service</title> <description lang='en' primary='true'> This is an example SRU service. </description> </databaseInfo> <indexInfo> <set identifier='info:srw/cql-context-set/1/dc-v1.1' name='dc' /> <index> <title>title</title> <map> <name set='dc'>title</name> </map> </index> </indexInfo> <schemaInfo> <schema identifier='info:srw/schema/1/dc-v1.1' sort='false' name='dc'> <title>Dublin Core</title> </schema> </schemaInfo> <configInfo> <default type='numberOfRecords'>9999</default> </configInfo> </explain> </recordData> </record> </explainResponse>
Scan
Scan operations list and enumerate the terms found in the remote database's index. Clients send scan requests and servers return lists of terms. The process is akin to browsing a back-of-the-book index where a person looks up a term in the back of a book and 'scans' the entries surrounding the term. At a minimum, scan operations must include a scan clause (scanClause) and a version number parameter. The scan clause contains the term to look for in the index. A rudimentary request and response follow:
- http://example.org/?operation=scan&scanClause=dog&version=1.1
<scanResponse> <version>1.1</version> <terms> <term> <value>doesn't</value> <numberOfRecords>1</numberOfRecords> </term> <term> <value>dog</value> <numberOfRecords>1</numberOfRecords> </term> <term> <value>dogs</value> <numberOfRecords>2</numberOfRecords> </term> </terms> </scanResponse>
SearchRetieve
SearchRetrieve operations are the heart of the matter. They provide the means to query the remote database and return search results. Queries must be articulated using the Common Query Language (CQL) [2]. These queries can range from simple free text searches to complex Boolean operations with nested queries and proximity qualifications.
Servers do not have to implement every aspect of CQL, but they have to know how to return diagnostic messages when something is requested but not supported. The results of searchRetrieve operations can be returned in any number of formats, as specified via explain operations. Examples might include structured but plain text streams or data marked up in XML vocabularies such as Dublin Core, MARCXML, MODS (Metadata Object Description Schema), etc. Below is a simple request for documents matching the free text query 'dogs':
- http://example.org/?operation=searchRetrieve&query=dog&version=1.1
In this case, the server returns three (3) hits and by default includes Dublin Core title and identifier elements. The record itself is marked up in some flavor of XML as opposed to being encapsulated as a string embedded the XML:
<searchRetrieveResponse> <version>1.1</version> <numberOfRecords>3</numberOfRecords> <records> <record> <recordSchema>info:srw/schema/1/dc-v1.1</recordSchema> <recordPacking>xml</recordPacking> <recordData> <dc> <title>The bottom dog</title> <identifier>http://example.org/bottom.html</identifier> </dc> </recordData> </record> <record> <recordSchema>info:srw/schema/1/dc-v1.1</recordSchema> <recordPacking>xml</recordPacking> <recordData> <dc> <title>Dog world</title> <identifier>http://example.org/dog.html</identifier> </dc> </recordData> </record> <record> <recordSchema>info:srw/schema/1/dc-v1.1</recordSchema> <recordPacking>xml</recordPacking> <recordData> <dc> <title>My Life as a Dog</title> <identifier>http://example.org/my.html</identifier> </dc> </recordData> </record> </records> </searchRetrieveResponse>
A Sample Application: Journal Locator
In an attempt to learn more about SRU, the author created a simple SRU interface to an index of journal titles, holdings, and locations. Like many academic libraries, the University Libraries of Notre Dame subscribe to physical and electronic journals. Many of the electronic journals are accessible through aggregated indexes such as Ebscohost Academic Search Elite. Since the content of these aggregated indexes is in a constant state of flux, it is notoriously difficult to use traditional catalogueuing techniques to describe journal holdings. Consequently, the Libraries support the finding of journal titles through its catalogue as well as through tools/services such as SFX (Special Effects) and SerialsSolutions. Unfortunately, when patrons ask the question "Does the library have access to journal...?", they need to consult two indexes: the catalogue and an interface to SFX.
Journal Locator is an example application intended to resolve this problem by combining the holdings in the catalogue with the holdings in SFX into a single search interface. By searching this combined index, patrons are presented with a list of journal titles, holding statements, and locations where the titles can be found. The whole thing is analogous to those large computer printouts created in the early to mid-1980s listing a library's journal holdings. Here is the process for creating Journal Locator:
- dump sets of MARC records encoded as serials from the catalogue
- transform the MARC records into sets of simple XHTML files
- dump sets of SFX records as an XML file
- transform the XML file into more sets of simple XHTML files
- index all the XHTML files
- provide an SRU interface to search the index
Here at the Notre Dame we use scripts written against a Perl module called MARC::Record [3] to convert MARC data into XHTML. We use xsltproc [4] to transform XML output from SFX into more XHTML. We use swish-e [5] to index the XHTML, and finally, we use a locally written Perl script to implement an SRU interface to the index. The interface is pretty much the basic vanilla flavour, i.e. supporting only explain and searchResponse operations. It returns raw XML with an associated XSLT (Extensible Stylesheet Language Transformations) stylesheet, and consequently the interface assumes the patron is using a relatively modern browser with a built-in XSLT processor. Journal Locator is not a production service.
A rudimentary SRU explain operation returns an explain response. The response is expected to be transformed by the XSLT stylesheet specified in the output into an XHTML form. Queries submitted through the form are sent to the server as SRU searchRetrieve operations. Once the query string of the URL is parsed by the server, the search statement is passed on to a subroutine. This routine searches the index, formats the results, and returns them accordingly. An example SRU searchRetrieve request may include:
Here is an abbreviated version of the search subroutine in the Perl script. Notice how it searches the index, initialises the XML output, loops through each search result, closes all the necessary elements, and returns the result:
sub search { # get the input my ($query, $style) = @_; # open the index my $swish = SWISH::API->new($INDEX); # create a search object my $search = $swish->New_Search_Object; # do the work my $results = $search->Execute($query); # get the number of hits my $hits = $results->Hits; # begin formatting the response my $response = "<?xml version='1.0' ?>\n"; $response .= "<?xml-stylesheet type='text/xsl' href='$style' ?>\n"; $response .= "<searchRetrieveResponse>\n"; $response .= "<version>1.1</version>\n"; $response .= "<numberOfRecords>$hits</numberOfRecords>\n"; # check for hits if ($hits) { # process each found record $response .= "<records>\n"; my $p = 0; while (my $record = $results->NextResult) { $response .= "<record>\n"; $response .= "<recordSchema>" . "info:srw/schema/1/dc-v1.1</recordSchema>\n"; $response .= "<recordPacking>xml</recordPacking>\n"; $response .= "<recordData>\n"; $response .= "<dc>\n"; $response .= "<title>" . &escape_entities($record->Property('title')) . "</title>\n"; # check for and process uri if ($record->Property ('url')) { $response .= "<identifier>" . &escape_entities($record->Property('url')) . "</identifier>\n" } # get and process holdings my $holding = $record->Property ('holding'); my @holdings = split (/\|/, $holding); foreach my $h (@holdings) { $response .= '<coverage>' . &escape_entities($h) . "</coverage>\n" } # clean up $response .= "</dc>\n"; $response .= "</recordData>\n"; $response .= "</record>\n"; # increment the pointer and check $p++; last if ($input->param('maximumRecords') == $p); } # close records $response .= "</records>\n"; } # close response $response .= "</searchRetrieveResponse>\n"; # return it return $response; }
The result is an XML stream looking much like this:
<?xml version='1.0' ?> <?xml-stylesheet type='text/xsl' href='etc/search.xsl' ?> <searchRetrieveResponse> <version>1.1</version> <numberOfRecords>2</numberOfRecords> <records> <record> <recordSchema>info:srw/schema/1/dc-v1.1</recordSchema> <recordPacking>xml</recordPacking> <recordData> <dc> <title>The bottom dog</title> <coverage> Microforms [Lower Level HESB] General Collection Microfilm 3639 v.1:no.1 (1917:Oct. 20)-v.1:no.5 (1917:Nov. 17) </coverage> </dc> </recordData> </record> <record> <recordSchema>info:srw/schema/1/dc-v1.1</recordSchema> <recordPacking>xml</recordPacking> <recordData> <dc> <title>Dog world</title> <identifier> http://sfx.nd.edu:8889/ndu_local?genre=article& sid=ND:ejl_loc&issn=0012-4893 </identifier> <coverage> EBSCO MasterFILE Premier:Full Text (Availability: from 1998) </coverage> </dc> </recordData> </record> </records> </searchRetrieveResponse>
Another nifty feature of SRW/U is the use of 'extra data parameters'. These parameters, always prefixed with x- in an SRU URL, allow implementers to add additional functionality to their applications. The author has used this option to create an x-suggest feature. By turning 'x-suggest' on (x-suggest=1), the system will examine the number of hits returned from a query and attempt to suggest additional searches ready for execution. For example, if the number of hits returned from a search is zero (0), then the application will create alternative searches analogous to Google's popular Did You Mean? service by looking up the user's search terms in a dictionary. If the number of hits is greater than twenty-five (25), then the application will help users limit their search by suggesting alternative searches such as title searches or phrase searches.
The next steps for Journal Locator are ambiguous. One one hand the integrated library system may be able to support this functionality some time soon, but the solution will be expensive and quite likely not exactly what we desire. On the other hand, a locally written solution will cost less in terms of cash outlays, but ongoing support may be an issue. In any event, Journal Locator provided a suitable venue for SRU exploration.
SRW/U and OAI-PMH
SRW/U and OAI-PMH are complementary protocols. They have similar goals, namely, the retrieval of metadata from remote hosts, but each provides functionality that the other does not. Both protocols have similar 'about' functions. SRW/U's explain operation and OAI-PMH's identify verb both return characteristics describing the properties of the remote service.
Both protocols have a sort of "browse" functionality. SRW/U has its scan function and OAI-PMH has ListSets. Scan is like browsing a book's back-of-the-book index. ListSets is similar to reading a book's table of contents.
SRW/U and OAI differ the most when it comes to retrieval. SRW/U provides a much more granular approach (precision) at the expense of constructing complex CQL queries. OAI-PMH is stronger on recall allowing a person to harvest the sum total of data a repository has to offer using a combination of the ListRecords and GetRecords verbs. This is implemented at the expense of gathering unwanted information.
If a set of data were exposed via SRW/U as well as OAI-PMH, then SRW/U would be the tool to use if a person wanted to extract only data crossing predefined sets. OAI-PMH would be more apropos if the person wanted to get everything or predefined subsets of the data.
OCKHAM
There is a plan to use SRU as a means to implement an alerting service in a sponsored NSF grant called OCKHAM [6]. The OCKHAM Project is lead by Martin Halbert (Emory University), Ed Fox (Virginia Tech), Jeremey Frumkin (Oregon State), and the author. The goals of OCKHAM are three-fold:
- To articulate and draft a reference model describing digital library services
- To propose a number of light-weight protocols along the lines of OAI-PMH as a means to facilitate digital library services
- To implement a select number of digital library services exemplifying the use of the protocols
OCKHAM proposes a number of initial services to be implemented:
- a registry service
- an alerting service
- a browsing service
- a pathfinder service
- a search service
- a metadata conversion service
- a cataloguing service
Notre Dame is taking the lead in developing the alerting service -- a sort of current awareness application allowing people to be kept abreast of newly available materials on the Internet. This is how the service will work:
- An institution (read 'library'), will create a list of OAI data repositories (URLs) containing information useful to its clientele.
- These URLs will be fed to an OAI harvester and the harvested information will be centrally stored. Only a limited amount of information will be retained, namely information that is no older than what the hosting institution defines as 'new.'
- One or more sets of MARC records will be harvested from library catalogues and saved to the central store as well. Again, only lists of 'new' items will be retained. As additional data is harvested the older data is removed.
- Users will be given the opportunity to create searches against the centralised store. These searches will be saved on behalf of the user and executed on a regular basis with the results returned via email, a Web page, an RSS feed, and/or some other format.
- Repeat.
SRU URL's will be the format of the saved searches outlined in Step 4 above. These URLs will be constructed through an interface allowing the user to qualify their searches with things like author names, titles, subject terms, free text terms or phrases, locations, and/or physical formats. By saving user profiles in the form of SRU URLs, patrons will be able to apply their profiles to other SRU-accessible indexes simply by changing the host and path specifications.
The goal is to promote the use of SRU URLs as a way of interfacing with alerting services as unambiguously and as openly as possible.
Summary
SRW and SRU are "brother and sister" standardised Web Service-based protocols for accomplishing the task of querying Internet-accessible databases and returning search results. If index providers were to expose their services via SRW and/or SRU, then the content of the 'hidden Web' would become more accessible and there would be less of a need to constantly re-invent the interfaces to these indexes.
Acknowledgements
The author would like to thank the people on the ZNG mailing list for their assistance. They were invaluable during the learning process. Special thanks goes to Ralph Levan of OCLC who helped clarify the meaning and purpose of XML packing.
References
- The home page for SRW/U is http://www.loc.gov/z3950/agency/zing/srw/
- CQL is fully described at http://www.loc.gov/z3950/agency/zing/cql/
- The home page of MARC::Record is http://marcpm.sourceforge.net/
- Xsltproc is an application written against two C libraries called libxml and libxslt described at http://xmlsoft.org/
- Swish-e is an indexer/search engine application with a C as well as a Perl interface. See: http://swish-e.org/
- The canonical URL of OCKHAM is http://www.ockham.org/