Learning to YODL: Building York's Digital Library
An overview of the first phase of developing a digital repository for multimedia resources at York University has recently been outlined by Elizabeth Harbord and Julie Allinson in Ariadne [1]. This article aims to provide a technical companion piece reflecting on a year’s progress in the technical development of the repository infrastructure. As Allinson and Harbord’s earlier article explained, it was decided to build the architecture using Fedora Commons [2] as the underlying repository, with the user interface being provided by Muradora [3]. Fedora Commons is a widely used and stable open source repository architecture with active user and developer communities, whilst Muradora has a much smaller user and developer base, and benefits from less stable project funding.
As a result, whilst it offered us a rich set of functionality that helped us to get our initial project up and running, the Muradora architecture was less developed and tested than that of Fedora Commons. This meant that there was always some risk involved in coupling any bespoke development too tightly to Muradora, whilst at the same time, quite a lot of in-house development was required in order to adapt Muradora to our needs. In fact, funding for the Muradora Project remains uncertain. The first section of this article will reflect in part on how we have managed our implementation and development in light of this, whilst the second and third sections will focus on two particular development areas: those of developing fine-grained access control integrated with York’s LDAP; and of using the X Forms technology to develop a workflow structure.
Our Setup
We are currently running Fedora 2.2.4 against the external Tomcat 6 which comes packaged with the Muradora 1.3.3 all-in-one installation package, on Java 5 and Solaris Unix. Our Fedora installation runs against a separately hosted Oracle database rather than the bundled McKoi database.
Reflections on Mura
Fedora Commons comes with a rich set of APIs but with no user interface of its own. This is in some ways a strength, making it very adaptable and easily integrated into a wide range of other architectures. On the other hand, it means that it cannot be used as an “out-of-the-box and ready-for-use” fully fledged digital library architecture – it needs to be integrated with some sort of user interface to sit on top of it. In our particular case we also needed to implement very fine-grained access control as many of the materials we intend to store carry complex and varied copyright requirements. Muradora does go some way towards providing this, although we found a lot of further bespoke development was required in order to fully meet our access control needs – this will be expanded on in section two.
As we needed to get a working system up and running quickly, we felt that Muradora provided an acceptable shot-term solution despite the risks of working with a comparatively unstable project. However we attempted to minimise these risks wherever possible by avoiding coupling our in-house development too tightly to Muradora, in case we needed to move away from it at a later date. Thus, while our workflow forms use the same X Forms technology as Muradora and the same Orbeon Forms X Form Engine [4] they are not tied to Muradora and can be used independently within other interfaces.
In practice we have still ended up investing a lot of time and development effort into our Muradora installation. This is partly because as a less developed architecture, we have found it to be a little ‘buggy’ and unfinished in parts. We found, for example, during the course of user testing, that apparently identical search types (‘simple’ ‘advanced’) used in different parts of the interface employ different algorithms. This was very confusing to users who could not understand why the same search used from different screens would return different results. So we had to spend some time looking under the bonnet and making them consistent. On the plus side, we found the Muradora developers to be very helpful during those periods when they had project time available, a lot of thanks is due to them.
Customising Muradora
In addition to work required to iron out bugs and inconsistencies, we have done quite a lot of interface customisation for cosmetic and usability purposes. Inevitably this work is not easily transferable between interfaces and will have to be revisited if we move away from Muradora, equally so if we upgrade our Muradora version. Changes requested ranged from simple logo and colour changes required to meet University branding requirements, to labelling changes (our test users found many of the default labels unintuitive as they were designed for text e-print resources rather than multimedia). Other changes requested were layout changes and for the provision of more extensive Help pages, as well as for even more complex changes. An example of the latter would be the request to break down very large search result lists not only into simple initial letter alphabetical divisions, i.e. A, B, C (provided in Muradora) but also second letter subdivisions i.e. Aa, Ab, Ac.
User Testing Process
Many of the interface customisation changes came about as a direct result of feedback from our user testing process. We have used a fairly agile development process and gave a test group of volunteers from our pilot user group access as soon as we had a usable interface up and running. In addition, we took inspiration from an accessibility seminar which some of us attended. One of the methods discussed was the “thinking aloud” method where a small group of users are given a set of tasks to work through using the software to be tested, and are observed whilst they perform these tasks, talking through their thought processes in a “stream of consciousness” manner as they do so. We found a small group of volunteers and set up such a session, which we found extremely useful. We found that with a 1:1 observer:volunteer ratio we were able to pick up on things which were obviously confusing our users and which might not have made it on to a remote testing feedback report.
Stability
While we have found Fedora Commons to be a fairly stable application, our Muradora installation has been less so. In particular we have found that if the application is shut down by any means other than a clean Tomcat shutdown (for example a system crash) the Berkeley DB XML database which it used to store XACML policies is prone to corruption. There have also been possible memory leak problems which we have so far been unable to pin down. Because the Muradora architecture is quite complex, it can be difficult to locate the origin of problems and we have sometimes resorted to a full rebuild of the indexes and databases in order to restore the system to its normal state.
Keeping Pace with Fedora
One drawback of working with a comparatively small project such as Muradora in combination with a big project like Fedora Commons has been that it has limited us when it comes to keeping pace with the latest changes to Fedora. The smaller project has had fewer resources to maintain continued compatibility with the latest Fedora versions. Thus we began with Muradora 1.3.1 and Fedora Commons 2.2.2, upgrading to Muradora 1.3.3. Although we managed to tweak our Muradora 1.3.3 installation to work with Fedora upgrades up to 2.2.4, Muradora 1.3.3 was not compatible with further Fedora upgrades and although a beta Fedora 3-compatible version of Muradora has been produced in beta, it is not a stable release. With Fedora 2.2.4 now approaching the end of its support lifecycle, this leaves us considering our options for moving to Fedora 3, which is a priority.
Access Control
Use Cases
Muradora uses eXtensible Access Control Markup Language (XACML) to describe access control policies, and also provides a graphical user interface from which to set access restrictions on individual resources, from Collection level down to individual images, clips or audio files. However because of the complex copyright requirements relating to our stored material, the out-of- the-box access control was not sufficiently fine-grained to meet our needs, and we have had to do further bespoke development.
In particular, we needed to be able to restrict access on the basis of user role (public, member of staff, undergraduate student, teacher, administrator, taught postgraduate student) and of membership of specific course modules. As a student or teacher is likely to be a member of multiple course modules, this also implies that it must be possible for a single user to have multiple roles – for example a single Art History student might be enrolled on courses covering Baroque Art, Anglo-Saxon Art, and the Bauhaus. In addition, we needed to be able to place different levels of restriction on different types of stored material belonging to the same resource. For example, in one use case a single stored image may have several ‘datastreams’ as they are known in Fedora Commons, consisting of XML metadata, a small thumbnail version of the image, a larger ‘preview’ image, and an archival quality image. In a typical use case we might want to exclude members of the general public from access to any of the datastreams, allow a university student to access the metadata and view the thumbnail, permit a student on the particular course module to which the image is linked to have access to the preview, but only allow an administrator or teacher to download the archival quality image.
LDAP Integration
The access control mechanism of YODL has been implemented by a set of filters [5].
During the authentication phrase, two Fedora servlet security filters are configured to access Fedora local users and student credentials from the University LDAP server. If a matched user is found either from the local user definition XML file (fedora-users.xml) or from LDAP server, the access request will be passed through the authentication filter and will be checked by Authorization filter thereafter.
The bespoke Authorization filter interacts with the University LDAP and data warehouse (data source for information about student modules) in order to access student credentials, their matriculation information and their membership of current and past course modules. The appropriate roles are assigned to authenticated users by combining the modules in which they enrolled and their personal information, e.g. which department the student belongs to, what is the role of this user (undergraduate student, postgraduate taught student, or research student, etc.).
On the other hand, role-based access control policies need to be created by editing the permissions for each course collection in order to give access to specified role(s) in the adapted Muradora user interface.
Indexing New Relationship Assertions
We found that the unmodified Muradora interface would not allow us place the sort of restrictions we needed on individual datastreams, such as the thumbnail version of a stored image rather than its archival quality version. In Muradora, restrictions can only be placed on objects indexed in the RIsearch engine. This does not by default include datastream labels, although it does include MIME types. We found a temporary work-round by listing incorrect MIME types as a means of distinguishing between different image versions; however this was clearly unsatisfactory. As a long- term solution we worked in partnership with a team of external developers, Steve Bayliss and Martin Dow at Acuity Unlimited [6], to implement new functionality developed by them which enables custom assertions to be made about individual datastreams and indexed by RIsearch, thus allowing us to make the assertion that datastream X has label Y and subsequently place access restrictions on those datastreams for which this assertion is true. The functionality involves creating a new XML datastream – named RELS-INT - for each resource to list such assertions, and some minor modifications are then made to the Muradora and Melcoe-PDP (Melcoe PDP is the Web service which accesses the XACML policies in Muradora) configuration files in order to make these assertions available as criteria in the user interface. The patch developed by the Acuity developers in order to do this has since been incorporated into the latest version of Fedora Commons. [7][8]
Developing this solution however generated a new problem: retrospectively modifying the stored data in order to correct the fudged MIME types and add the appropriate XML assertion datastreams. With over 7000 individual resources already stored in our repository, making each amendment individually was out of the question. Fedora does offer an out-of-the-box batch modification facility capable of modifying data on the fly, but this requires the creation of an additional modification directives file specifying each individual change to be made to each individual resource – a gargantuan task if it has to be done manually. However it has proved possible to use RIsearch to create lists of resources requiring specific changes, and pass these as parameters to Unix shell scripts to create the required modification directives files dynamically, thereby permitting successful bulk data modification on the fly.
Submission Workflows
Muradora has a submission wizard to facilitate the process of creating new digital objects. Users with appropriate permissions can submit a new object by three steps, e.g. selecting parent collection and object content model, uploading/specifying resources, and entering metadata. However, the drawbacks of Muradora’s submission workflow prevent it being used in a production environment. In the current deposit workflow, the depositor has to wait while files upload as uploading constitutes the step prior to entering metadata. Therefore, the current workflow is not efficient especially when uploading large files. As a result, two separate asynchronous processes for uploading/processing resources and submitting metadata would be a better choice in terms of efficiency and performance. Bespoke workflow is another requirement for specific user groups. For example, some depositors have agreed to a smaller ‘Preview’ image being made public whilst restricting the full-sized image to University users. Therefore, a preview image should be generated from the original image when an image is submitted, something which is not implemented in Muradora’s workflow.
In summary, the new workflow should be able to deposit in a more efficient way and should be able to deposit any type of file, which can be divided into three categories as shown below:
- Fully supported files (e.g. TIFF/JPEG images, WAV audio files, and ISO CD/DVD images): the corresponding processing for each type of file will be defined individually in the workflow. For example, a TIFF image file will be transformed to a full-size JPEG file, to a preview JPEG file, and to a thumbnail JPEG file. The original TIFF image and all three generated image files will be ingested into Fedora as datastreams.
- Partly supported files (e.g. BMP/PNG images): for these files, a generic processing will be defined for a generic type. For example, GenericImage for any declared partly supported images, GenericAudio for any declared partly supported audio files.
- Unsupported files (e.g. AVI file for now): for these files, a more generic (‘Generic of generic’) process is defined. For example, when an AVI file is selected, the file will be ingested into Fedora as a datastream under a pre-defined fixed name and a pre-defined thumbnail image will be used for any unsupported file.
As shown in Figure 1, to support the asynchronous deposit process, an ingest server can be used by University-wide users as temporary storage for resources to be ingested into YODL. Depositors can specify resources via various ways, e.g. select resources from a mapped drive of the ingest server on their own PC, or upload resources from their local drive, or point to a URL either as ‘redirect’ or ‘external’ links. All resources will be mapped to a URL and will be ready for ingestion. Based on the content model and editor selected by the depositor, the appropriate Xform [10] will be launched to enter metadata. Currently, a VRA [11] XForm editor has been developed for images, and a customized MODS editor is under development for audios. After submitting an XForm, metadata will be saved into Fedora directly. At the same time, an asynchronous process is used to process pre-prepared resources. A message containing resource details will be sent to YODL server, where a program is running to process all resources, e.g. following an appropriate workflow for a specific resource, and ingest these resources into the Fedora server as datastreams.
As YODL is expected to support more and more file types in specific ways, it is desirable that the workflow is reusable when the support of a new file format is required. The ideal development scenario when the workflow is asked to deal with a new file format (MIME type) will be:
- Add new code to deal with the new file format
- Modify related configuration file(s)
- No need to modify any existing code
As shown in Figure 2, factory design patterns [9, pp.87] is used to maximize the reusability of existing workflow. A matched factory will be used to process each file type. Basically, these factories implement the processing logic for each file type. For example, TIFFileProcessingFactory defines the processing logic for TIFF images, e.g. transform TIFF image to JPEG image, and generate preview and thumbnail images. Currently, only a few file types have their specific factories, e.g. JPG, TIFF, ISO, and WAV. It is impossible to have a specific factory for each file type. Therefore, some generic factories are designed to process a general category of files, e.g. GenericImageFileProcessingFactory for generic images including BMP, PNG, and GIF, and GenericAudioFileProcessingFactory for generic audios such as MP3. In addition, a more generic factory namely GenericFileProcessingFacotry is used to process all other file types.
Conclusions and Future Work
Our decision to use Muradora as our initial interface provided us with a very rich set of functionality despite the drawbacks discussed earlier. Without it, it is unlikely we would have been able to develop our own interface with an equivalent set of functions within the short time frame planned. However our caution about Muradora’s suitability as a long-term solution is still in question. Since Muradora is Open Source, our possible options include continuing to work with and develop the existing code base, attempting to iron out the bugs, and make it fully compatible with the latest Fedora version. Alternatively, we can either look for another existing interface which can be hooked into Fedora, or develop our own interface from scratch, accepting that we will need to continue to work with the older Fedora version in the meantime.
Areas for future development include further work on the access control architecture. It is often the case that a flat list of modules and manual role-by-role application of policies is not scalable. There is a need to apply access control to roles in a more managed and sustainable way. A hierarchical representation of roles would be a clearer and efficient way to display roles. In addition, a flexible search interface is needed to query appropriate role-based or collection/object-based policies. Time-limited policies are also needed to manage short-term policies, e.g. policies crossing academic year(s).
Other development is planned through our JISC-funded YODL-ING Project [12] which proposed to develop a suite of technical enhancements to the Digital Library.
References
- Allinson, J, Harbord, E, “SHERPA to YODL-ING:Digital Mountaineering at York”, July 2009, Ariadne, Issue 60
http://www.ariadne.ac.uk/issue60/allinson-harbord/ - Fedora Commons http://www.fedora.info/
- Muradora http://www.muradora.org/
- Orbeon Forms X Forms Engine http://www.orbeon.com/ops/doc/index
- Sun servlet filter: http://java.sun.com/products/servlet/Filters.html
- Acuity Unlimited email: stephen.bayliss@acuityunlimited.co.uk, martin.dow@acuityunlimited.co.uk
- Fedora tracker item for RELS-INT: http://www.fedora-commons.org/jira/browse/FCREPO-441
- Fedora relationship documentation, now including RELS-INT
http://www.fedora-commons.org/confluence/display/FCR30/Digital+Object+Relationships - Gamma, E., Helm, R., Johnson, R., Vlissides, J. “Design Patterns: Elements of Reusable Object-Oriented Software” (2001),
Addison-Wesley, ISBN: 0-201-63361-2. - XForms http://www.w3.org/MarkUp/Forms/
- VRA core 4.0 http://www.vraweb.org/projects/vracore4/
- YODL-ING http://www.york.ac.uk/library/electroniclibrary/yorkdigitallibraryyodl/#yodl-ing