ARROW, DART and ARCHER: A Quiver Full of Research Repository and Related Projects

david groenewegen; andrew treloar

ARROW, DART and ARCHER: A Quiver Full of Research Repository and Related Projects

Andrew Treloar and David Groenewegen describe three inter-related projects to support scholarly outputs and the e-research life cycle which have been funded by the Australian Commonwealth Government.

This paper describes three inter-related repository projects. These projects were all funded by the Australian Commonwealth Government through the Systemic Infrastructure Initiative as part of the Commonwealth Government’s Backing Australia’s Ability - An Innovation Action Plan for the Future. The article will describe the background to all three projects and the way in which their development has been inter-related and co-ordinated. The article will conclude by examining how Monash University (the lead institution in all three projects) is re-conceiving the relationship between its different repositories.

ARROW

The Australian Research Repositories Online to the World (ARROW) Project came into existence in response to a call for proposals issued in June 2003 by the Australian Commonwealth Department of Education, Science and Training (DEST). DEST was interested in furthering the discovery, creation, management and dissemination of Australian research information in a digital environment. Specifically, it wanted to fund proposals that would help promote Australian research output and build the Australian research information infrastructure through the development of distributed digital repositories and the common technical services supporting access and authorisation to them.

In response to this a consortium, consisting of Monash University (lead institution), University of New South Wales, Swinburne University and the National Library of Australia, submitted a bid and was successful in attracting $A3.66M over three years (2004-6), with follow-up funding in 2007 of $4.5M for ARROW2 and a sub-project called Persistent Identifier Linking Infrastructure (PILIN).

Design

The design of the ARROW repository solution was informed by the desire to:

use a common underlying repository for a range of content types
provide content management modules for different use cases
expose the content as widely as possible using a number of different technologies

The resulting high-level design is shown in Figure 1.

Figure 1: ARROW High-level Architecture

Development

After careful analysis of the available candidates at the time, the ARROW Project decided to use the Fedora Open Source software [1] as the foundation of the repository [2].

Fedora provides an underlying engine, but at the time offered little in the way of software to run on top of this engine, so ARROW needed to build this separately. One of the requirements from the funding agency was that the project address the sustainability of any solution after the project funding concluded. The ARROW Project therefore decided to collaborate with a commercial software developer to produce much of the ARROW software. The developer, VTLS Inc [3] already had a basic digital image collection management tool on the market called VITAL that was built on top of Fedora. ARROW Consortium members licensed VITAL 1.0 and then worked with VTLS to extend the functionality of both VITAL and Fedora. In the former case, this was through working with VTLS to specify a series of VITAL releases (1.1, 1.2, 2.0, 2.1 and 3.0). In the latter case, ARROW commissioned VTLS to produce a series of open-source modules to either extend or complement Fedora. This unusual partnership has been generally successful, with significant sharing of intellectual property. At the end of the ARROW Project, those members who have adopted VITAL will have a normal commercial relationship with VTLS, and will receive both support and successive versions of VITAL as long as they continue to pay software maintenance.

The selected search and exposure services were the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [4], Search/Retrieve Web & Search/Retrieve URL (SRU/SRW) [5], Web spidering, and a native Web access portal. SRU/SRW support was one of the open-source components written by VTLS. In addition to providing support for these search and exposure protocols, the National Library of Australia (as part of its contribution to the ARROW consortium) has developed a National Discovery Service [6] that harvests metadata from ARROW, DSpace and ePrints repositories within Australia.

Deployment

Since the commencement of the ARROW Project a total of 15 Australian universities (out of a total of 40) have licensed the VITAL software. The emergence of this large community (in addition to overseas licensees) augurs well for the viability of the software solution. More information on ARROW is available [7].

DART

In early 2005, the Australian Government called for a second round of Systemic Infrastructure Initiative proposals for collaborative projects that brought together consortia to improve accessibility to Australian research. This call for proposals identified four areas of interest:

maximising access to digital resources in Australian universities, especially regional universities;
creating new types of digital libraries to manage extremely large datasets;
adopting a national approach to improving open access to the results of publicly funded research;
providing effective linkages between sets of research information to enable seamless access by researchers.

The call for proposals also identified a number of key trends that are changing the ways in which research is conducted and its outputs consumed. These include new technologies, such as computer simulations, synchrotrons and sensor networks, the expanding size of the datasets on which research is based, increasing volumes of information generated through research, greater complexity, and the recognition of the need to work across traditional disciplinary, institutional and national borders. To this one might add a growth in research practices that are producing a paradigm change in the types of research that this new large-scale computing/data management environment can support. These emerging research practices are intensely collaborative (often involving trans-national teams), require high-quality network access, and are data- and simulation-intensive.

In response to this second call, Monash University again took the lead and submitted a bid entitled Dataset Acquisition, Accessibility and Annotations e-Research Technologies (DART). The DART request for funding built on the work already done in the ARROW Project in establishing the basis for institutional research publication repositories, as well as antecedent activity at each of the three DART partners (James Cook University, Monash University and the University of Queensland). It did this by extending its areas of interest to tackle issues associated with large datasets and sensors, as well as annotation technologies and collaborative, composite documents.

The DART bid was successful and received $A3.23M in August 2005, with an original project end date of December 2006 (later extended to June 2007).

Design

Figure 2 shows the high-level design for DART. In the uppermost layer are researchers, readers and computer programs. The middle layer shows the proposed repositories (including traditional publications as research outputs, and raw data) and the data flows between them and the datasets in the lowest layer. The lowest layer shows the data sources and their associated storage. The figure has been annotated to indicate the work packages that are involved for each component. For instance, the process of editing dynamic collaborative documents is described in work package AA4. Details of the work packages can be found on the DART Web site [8], and are briefly outlined below.

Figure 2: DART High-Level Architecture

The DART Project was structured as a number of thematically grouped but inter-related sets of work packages.

In the Data Collection, Monitoring and Quality Assurance (DMQ) theme, DART is tackling the issues surrounding high-rate and large-volume data streams, particularly those generated by instruments and sensors. There is a number of requirements that are unique to the challenges inherent in dealing with digital objects generated by and derived from instruments and sensors. These include:

two-way communication with the instruments and sensors so that their status and information can be probed and monitored remotely (work packages DMQ1, DMQ2)
implementing a standard approach for detecting faulty or poor-quality data early in the experiment (DMQ3)
providing enhanced security and access to the instruments and sensors (DMQ4)
triggering the download of data contained in temporary data storage (data cache) into the permanent data storage (DMQ5)

In the Storage and Interoperability (SI) theme, DART is:

working to integrate Fedora and the Storage Resource Broker 9 (SI1, SI2)
semantically augmenting the SRB Metadata Catalogue (MCAT) (SI3)
providing a secure service for transferring data from sensors/instruments to repositories using Grid security (SI4)
developing an abstraction layer that supports a range of data replication systems (SI5)
allowing simulation data to be retrieved from repositories or regenerated dynamically using computational services (SI6)
developing a cost-effective data pre-processing system for the secondary storage (SI7)
piloting long-distance high-speed and secure data transfer between repositories (SI8)
scoping and piloting storage infrastructure requirements (SI9)

In the Contents and Rights (CR) theme, DART is:

developing simple user interfaces, guidelines, and workflows to enable researchers to deposit documents, research data and results easily into institutional repositories (CR1, CR5)
producing tools and services to enable researchers easily to select and attach standardised licenses defining access and re-use rights to their data and research results (CR3)
creating guidelines for information management best practice in research teams, arising from embedding information professionals into such teams as research partners (CR4)
identifying and clarifying legal issues around IP (Intellectual Property), information security and privacy (CR6)

The Annotation and Assessment (AA) theme is looking at how best to support adding value to the contents of repositories post-publication through:

extending and refining existing annotation tools to enable annotation of digital objects held within the Fedora and SRB research repositories (AA1)
enhancing existing tools to support collaborative annotations, thus enabling research communities to document shared practices and assessments (AA2)
development of secure authenticated access to annotation servers (AA3)
investigating the use of Web-based collaborative tools to support distributed teams (AA4)

The final theme, Discovery and Access (DA) aims to improve the accessibility of publications and datasets by:

allowing the creators of digital objects better to control end-user access, thus reducing their reluctance to contribute (DA1)
developing a portal to provide seamless search interfaces across distributed archives implemented in SRB and Fedora (DA2)
developing and providing access to a centralised repository/registry of metadata schemas and ontologies (DA3)

Demonstrators

DART was designed as a proof-of-concept project. In order to ground the development activity in the needs of real researchers, DART has been working with researchers in three different domains: x-ray crystallography, digital history and climate research. Of these, the x-ray crystallography demonstrator is the one that has progressed to the greatest extent. A Gridsphere [10] portal has been created which supports elements of all the DART themes, and which is being used by one of the lead proteomics research teams at Monash University. In addition to the demonstrators, the outputs of the various work packages are progressively being documented and made available on the DART Web site.

ARCHER

The Australian ResearCH Enabling enviRonment (ARCHER) Project was funded under the third round of SII finding, receiving $A4.5M. The project effectively commenced in early 2007 and is required to complete by the end of 2007 (a very tight timeframe). ARCHER aims to take the best of the proof-of-concept work from DART and turn it into robust software, ready for deployment. At the time of writing, ARCHER had just completed the refinement of requirements and was engaged in detailed project planning. Further details can be found at the ARCHER Web site [11].

ARCHER will also be working with a number of groups of leading Australian researchers to make sure it meets their needs. These researchers are drawn from a selection of the capability groups under the National Collaborative Research Infrastructure Strategy (NCRIS). This is another Australian Commonwealth Government programme, designed in part to succeed the SII programme and in part to support particular research areas of interest to, or relevance for Australia. The NCRIS funding programme runs from 2007 to 2011. As the ARCHER funding is only for the first year of the NCRIS timeframe, one of the ARCHER tasks will be to identify a long-term deployment provider who can take over the ARCHER deliverables and provide a service to NCRIS.

Design

The ARCHER design has tried to focus on what is likely to be of benefit to researchers across a range of disciplines. After a careful process of requirements development, this has been narrowed down to the following set of functions:

a component integration framework based on the Gridsphere portal environment and the Kepler [12] workflow engine
a security architecture that supports Shibboleth [13], Public Key Infrastructure and Grid security as required, while trying to maximise the use of Shibboleth
data collection support for sensors, instruments and major national facilities
rich metadata support including extraction and management
a metadata schema/ontology registry
a storage fabric based on the Storage Resource Broker SRB that also supports offline work
integration of capability-specific data analysis packages
annotation tools
the ability automatically to publish data to discipline/journal repositories, as well as export it in different formats
search and browse functionality

Figure 3 shows the way in which these components relate to the research process, as well as how ARCHER tools complement those likely to be provided already by institutions at which researchers work.

diagram (173KB) : Figure 3 : ARCHER and the research life cycle

Figure 3: ARCHER and the research life cycle Text-only version

Development

ARCHER has four possible strategies for completing this ambitious work schedule:

Assign work to existing clusters of DART activity
- this will probably produce the least delays, but writing robust production quality code is a very different thing to producing proof-of-concept code
Adopt/adapt existing open source components
- this builds on already mature work
Commission work from existing open source developers
- this builds on their existing talent pools and also improves the chances that the result will be integrated into their codebase
Call for bids for components in an open tender process
- while initially attractive, it seems unlikely that the tight time constraints will permit this

In practice, all but the last of these strategies will probably be adopted to varying degrees.

Relationships between the Projects

ARROW was initially envisaged as a total solution for the storage and access of digital materials. Work on the project demonstrated that it was too ambitious a project, and that building the tools necessary into a single space was unlikely to be realistic. The DART Project gave us the beginnings of an understanding of the type of tools that would be needed in the collaborative space we originally envisaged. As ARCHER progresses the linkages between it and ARROW will become stronger. Between them we envisage a ‘curation boundary’ - a software-based workflow that will use human intervention to decide what moves from the collaborative space that ARCHER represents, into the ‘publishing’ section that ARROW provides.

This curation boundary will not be rigidly defined, but will be designed to ensure that any material that crosses it adheres to the needs and rules of the area it is entering.

Conclusion

The original ARROW bid envisaged a single underlying repository that would underpin all the research outputs of a university. The process of moving from ARROW to DART and now ARCHER has suggested an alternative model. This is based on two different kinds of repository: one optimised for collaboration and one for publication. As both ARROW and ARCHER move out of the project phase and into production, these ideas will be further developed, enabling a more mature assessment of the value of this approach. Until then, it is safe to conclude that ARROW is clearly a success in its own terms, DART has made significant advances and ARCHER is showing early promise.

References

Fedora Web site http://www.fedora.info/
For more background on the reasons for this decision see Treloar, A., “ARROW Targets: Institutional Repositories, Open-Source, and Web Services”, Proceedings of AusWeb05, the Eleventh Australian World Wide Web Conference, Southern Cross University Press, Southern Cross University, July 2005. http://ausweb.scu.edu.au/aw05/papers/refereed/treloar/
Visionary Technology in Library Solutions (VTLS) Web site http://www.vtls.com/
OAI-PMH Web site http://www.openarchives.org/OAI/openarchivesprotocol.html
SRU/SRW Web site http://www.loc.gov/z3950/agency/zing/
Australian Research Repositories Online to the World (ARROW) National Discovery Service Web site http://search.arrow.edu.au/
Payne, G and Treloar, A., “The ARROW Project after two years: are we hitting our targets?“, Proceedings of VALA 2006, Melbourne, January 2006.
Dataset Acquisition, Accessibility and Annotation e-Research Technologies (DART) Web site http://dart.edu.au/
Storage Resource Broker (SRB) Web site http://www.sdsc.edu/srb/
Gridsphere portal Web site http://www.gridsphere.org/
Australian ResearCH Enabling environment (ARCHER) Web site http://archer.edu.au/
Kepler workflow engine Web site http://www.kepler-project.org/
Shibboleth security system Web site http://shibboleth.internet2.edu/

Author Details

Dr Andrew Treloar
Director and Chief Architect, ARCHER Project
Monash University
Victoria 3800
Australia

Email: Andrew.Treloar@its.monash.edu.au
Web site: http://andrew.treloar.net/

David Groenewegen
Project Manager, ARROW Project
Monash University
Victoria 3800
Australia

Email: David.Groenewegen@lib.monash.edu.au
Web site: http://arrow.edu.au/

Return to top