Towards Interoperabilty of European Language Resources
A core component of the European Union is a common market with a single information space that works with around two dozen national languages and many regional languages. This wide variety of languages presents linguistic barriers that can severely limit the free flow of goods, information and services throughout Europe.
In this article, we provide an overview of the META-NET Network of Excellence [1]. This is an ambitious initiative, consisting of 44 centres from 31 countries in Europe, aiming to improve significantly on the number of language technologies that can assist European citizens, by supporting enhanced communication and co-operation across languages. A major outcome of the project will be META-SHARE, a searchable network of repositories that collect together resources such as language data, tools and related Web services, covering a large number of European languages. The resources within these repositories are intended to facilitate the development and evaluation of a wide range of new language processing applications and services.
Various new applications can be built by combining together existing resources in different ways. This process can be helped greatly by ensuring that individual resources are interoperable, i.e., that they can be combined together with little or no configuration. The UIMA (Unstructured Information Management Architecture) framework [2][3] is concerned specifically with ensuring the interoperability of resources, and the U-Compare platform [4][5][6], built on top of UIMA, is designed especially to facilitate the rapid construction and evaluation of natural language- processing/text-mining applications using interoperable resources, without the need for any additional programming. U-Compare comes together with a library of resources of several different types. As part of META-NET, this library will be extended to cover a number of different European languages. The functionality of U-Compare will also be enhanced to allow the use of multi-lingual and cross-lingual components, such as those that carry out automatic machine translation. By integrating and showcasing the functionality of U-Compare within META-SHARE, it is intended to demonstrate that META-SHARE can serve not only as a useful tool to locate language resources for a range of languages, but also act as an integrated environment that allows for rapid prototyping and testing of applications that make use of these resources.
A recent survey requested by the European commission [7] found that users of the Internet in Europe often have to resort to using services and Web pages in languages other than their own for various different purposes: 57% of users purchase goods and services in languages other than their own. Similarly, 55% of users read or watch content in foreign languages whilst 35% use another language when writing emails, sending messages or posting content on the Web. In all of the above cases, English is by far the most frequently used language, other than the users’ own, when performing activities on the Web.
According to the above statistics, European citizens who cannot speak a foreign language could be considered to be at a distinct disadvantage when it comes to the use of technology. Given the increasingly pervasive nature of technology in our everyday lives, this situation needs to be addressed in order to ensure that European cultural and linguistic diversity is preserved, and that maintaining one’s mother tongue does not become a disadvantage.
META-NET
META-NET is a Network of Excellence created with the aim of responding to issues such as those introduced in the section above. Consisting of 44 research centres from 31 countries, it is dedicated to building the technological foundations of a multi-lingual European information society, in order to provide users of all European languages with equal access to information and knowledge.
The achievement of this aim is dependent on the ready availability of data, tools and services that can perform natural language processing (NLP) and text mining ™ on a range of European languages. They will form the building blocks for constructing language technology applications that can help European citizens to gain easy access to the information they require. Among these applications will be systems that can automatically translate Web content between multiple languages, semantic search systems that will provide users with fast and efficient access to precisely the information they require, and voice user interfaces that can allow easy access to information and services over the telephone, e.g., booking tickets, etc.
Although NLP and TM technologies are well-developed for English, this cannot be said for all European languages, and in some cases, very sparse resources are currently available. It is for this reason that a concerted, substantial and continent-wide effort in language technology research and engineering is needed in order to make applications such as those outlined above available to all European citizens. It is the main aim of META-NET to stimulate, drive and support such an effort.
META-NET has founded the Multilingual Europe Technology Alliance (META). This brings together a large number of stakeholders (currently over 250 members in 42 countries), including researchers, commercial technology providers, private and corporate language technology users, language professionals and other information society stakeholders. The input, expertise and influence of these stakeholders are intended to drive research in the right direction, in order to ensure the success of the initiative.
As part of the initial work of META, a technology visioning process is taking place. The purpose of this visioning process is to produce ideas and concepts for visionary language technology applications and to provide input for technology forecasts. Three vision groups were set up for this purpose, corresponding to Translation and Localisation, Media and Information Services and Interactive Systems. The outcome of the visioning process will feed into the creation of a strategic research agenda for Europe’s Language Technology landscape. The agenda will contain high-level recommendations, ideas for visionary applications, and suggestions for joint actions to be presented to the EC and national as well as regional bodies.
META-SHARE
One of the major outcomes of META-NET will be the META-SHARE platform, which will be an open distributed facility for the sharing and exchange of resources. It is here that the outcomes of the research effort described above will be made available. META-SHARE will consist of a sustainable network of repositories of language data, tools and related Web services for a large number of European languages, documented with high-quality metadata and aggregated in central inventories, allowing for uniform search and access to resources.
Data may include annotated corpora, to be used for training and/or evaluation of tools, thesauri, dictionaries, etc., whilst tools and services may include both complete applications, e.g., semantic search systems or automatic translation systems, as well as individual components, e.g., part-of-speech (POS) taggers, syntactic parsers, named entity recognisers, etc, that can be reused, repurposed and combined in different ways to develop and evaluate new products and services.
Work is currently ongoing to prepare an initial set of resources for inclusion in META-SHARE. Research groups from a range of European countries are examining, cleaning, upgrading and documenting their resources in order to ensure the quality of these initial resources. Each group is also responsible for attracting external parties to make their resources available on META-SHARE. The platform will permit different licences to be associated with different resources, with either open or restricted access rights. It will also be possible to make available not only resources that are usable free-of-charge, but also those that require a fee for use. According to this flexibility, it is hoped to attract a large number and variety of tools to be made available on META-SHARE. This will allow it to grow to become an important component of a language technology marketplace for Human Language Technology (HLT) researchers and developers, language professionals (translators, interpreters, content and software localisation experts, etc.), as well as for industrial players, especially small and medium enterprises, catering for the full development cycle of HLT, from research through to innovative products and services.
Interoperability
As stated above, a major aim of META-NET is to facilitate the development and evaluation of a wide range of new language technology applications, by making readily available the building blocks (e.g., component language processing tools such as tokenisers, sentence splitters, part-of-speech taggers, parsers, etc.) from which such applications can be constructed, through reuse and combination in various ways. The planned functionality of META-SHARE will be a huge asset in this respect, in that, thanks to the central inventories and the high-quality metadata accompanying each resource, detailed and accurate searches may be carried out for many different resources dealing with different languages.
However, even assuming that a set of resources providing the appropriate functionalities to build the required application can be located using the META-SHARE search functionality, a major potential issue that still remains concerns the ease with which the resources located can be combined together to create the application. Especially if the selected resources have been developed by a range of different groups, one or more of the following issues of compatibility between the resources may arise:
- Tools may be implemented using different programming languages
- Input/output formats of the tools may be different. For example, a particular part-of-speech tagger may produce plain text output. However, other tools that require part-of-speech tagged data as input (e.g., a syntactic parser) may require this data to be in a different format (e.g. XML).
- Data types required or produced by tools may be incompatible. For example, the syntactic constituent types produced by a particular parser may be different to the ones required by a tool that requires access to parsed data (e.g., a named entity recogniser).
Having to deal with such issues can be both time-consuming and a source of frustration, often requiring program code to be rewritten or extra code produced in order to ensure that data can pass freely and correctly between the different resources used in the application.
At the University of Manchester, one of our major contributions to META-NET will be to promote and facilitate the interoperability of resources that are made available in META-SHARE. By interoperability, we mean that a set of resources can be combined together (i.e., they can ‘talk’ to each other) with little or no configuration required, thus alleviating the potential issues outlined above. In recent years, the issue of interoperability has been receiving increasing attention, as evidenced by efforts such as [8][9][10]. The Unstructured Information Management Architecture (UIMA) provides a flexible and extensible architecture for implementing interoperability, as explained in more detail below. As part of our work, we will encourage the adoption of the UIMA framework by other project partners, and showcase the benefits that this can bring to the rapid prototyping and evaluation of NLP systems, through the use of the U-Compare platform.
U-Compare
U-Compare has been developed by the University of Tokyo, the National Centre for Text Mining (NaCTeM) at the University of Manchester and the University of Colorado, with the main purpose of supporting rapid and flexible construction of NLP applications and easy evaluation of applications against gold- standard-annotated data.
U-Compare provides a graphical user interface (launchable via a single click from the Web site), which allows users to construct NLP applications using simple drag-and-drop actions. Each resource is simply dragged from a library of available components (which may be locally deployed or running as Web services) onto a workflow ‘canvas’, in the required order of execution. Once a complete workflow has been specified, it can be applied to a set of documents at the click of a button. An example of a possible workflow for carrying out named entity recognition is the following:
Tokeniser ? POS Tagger ? Syntactic Parser ? Named Entity Recogniser
A workflow will typically add one or more types of annotations to the documents, corresponding to sentences, tokens, parts of speech, syntactic parses, named entities, etc. Different annotation viewers make it possible to visualise these annotations, including more complex annotation types such as syntactic trees and feature structures. Additional features include the ability to save and load complete workflows, and to import/export both workflows and individual resources. Figure 1 shows the U-Compare graphical user interface, with the library of available components on the right, and the workflow canvas on the left.
Figure 1: U-Compare interface
Special facilities are also provided in U-Compare for evaluating the performance of workflows. With a task such as named entity recognition, it is often the case that tools need to be reconfigured each time a new training corpus or dictionary is released. Such reconfiguration may be due to a different data format or additional information that is present in the new resource. Following reconfiguration, even seemingly insignificant changes in the performance of individual tools could result in a lower performance of the system as a whole. Since performance issues may be a direct consequence of a suboptimal workflow, it is here that the real power of U-Compare becomes apparent, in that several different possible manifestations of a workflow (e.g., with particular individual components substituted for others) may be applied to input texts in parallel. This is illustrated in Figure 2, where the each step of the workflow may be undertaken by a range of different tools that perform similar tasks. Provided with such banks of tools, U-Compare will automatically compare the performance of each possible combination of tools against gold-standard (i.e., manually produced) annotations.
Figure 2: A reconfigurable workflow
Results are reported automatically in terms of performance statistics that measure the correctness of the annotations produced, i.e., precision and recall as well as F-measure, which is the harmonic mean of precision and recall [11]. Figure 3 illustrates the display of workflow evaluation data. On the left are the performance statistics and on the right are the annotations produced by the various tools under evaluation.
Figure 3: Evaluation in U-Compare
The utility of U-Compare with respect to issues such as those introduced above has recently been demonstrated in the recognition of chemical named entities in scientific texts [12]. This work was concerned with a well-established named entity recogniser for the chemistry domain, Oscar3 [13], whose rigid structure made it difficult to modularise and to adapt to new and emerging trends in annotation and corpora. Oscar3 was refactored into a number of separate, reconfigurable U-Compare components, and experiments showed that the substitution of a new tokeniser into the workflow could improve performance over the original system.
The Unstructured Information Management Architecture (UIMA)
The ease with which workflows can be built in U-Compare is dependent on the resources in its library being interoperable. This is mainly achieved and enforced by the fact that U-Compare is built on top of UIMA [3].
The main way in which UIMA achieves interoperability is by virtue of a standard means of communication between resources. Each resource to be used within the UIMA framework (and hence within U-Compare) must be ‘wrapped’ as a UIMA component. This means that it must be specifically configured to obtain its input by reading appropriate annotations from a common data structure called the Common Analysis System (CAS), which is accessible by all resources in the workflow. Output is produced by writing new annotations to the CAS, or updating existing annotations. For example, a tokeniser may add token annotations to the CAS. A POS tagger may subsequently read token annotations, and add a POS feature to them. This standard mechanism for input/output in UIMA allows components to be combined in flexible ways by alleviating the need to worry about differing input/output formats of different tools. It is only necessary to ensure that the types of annotation required as input by a particular component are present in the CAS at the time of execution of the component.
Given that existing tools have varying formats of input/output, some initial work is required to wrap these tools as UIMA components. This generally involves writing some extra code to carry out the following steps:
- Read appropriate annotations from the CAS
- Convert the UIMA annotations to input format required by the tool
- Execute the tool with the given input
- Convert the output of the tool to UIMA annotations
- Write the UIMA annotations to the CAS
The UIMA framework also deals with another issue of interoperability, in that after resources are wrapped as UIMA components, the original programming language is hidden and thus becomes irrelevant. Writing the UIMA wrapper is fairly straightforward when the resource is implemented in either Java or C++, or if the tool is available as a Web service or as a binary.
Type System
Each UIMA component must declare the types of annotations that it requires as input and produces as output. These annotation types are defined in a separate file, and may be hierarchically structured. For example, a type SemanticAnnotation may contain the subtypes NamedEntityAnnotation and CoreferenceAnnotation. Each annotation type may additionally define features, e.g., the token type may have a PartOfSpeech feature. The file containing a set of annotation types is called a Type System.
Examining the input and output annotation types of different components can help to determine which sets of components can be combined together into workflows, and how, by considering their dependencies. U-Compare makes this process easy, by displaying the input/output types of each component within the graphical user interface.
Different NLP research groups have produced different repositories of UIMA components, e.g., [14][15]. However, because the UIMA framework itself does not attempt to impose any restrictions or recommendations regarding the use of a particular annotation system, each group generally defines its own type system. This means that components developed by one group cannot be seamlessly combined or substituted easily with those developed by another group, because their different type systems may use different names or have different hierarchical structures, even though functionalities of the components may be similar. Such issues can be a major obstacle for interoperability.
Ideally, in order to achieve maximum interoperability, a single, common type system would be imposed, to be followed by all developers of UIMA components. However, this is not considered a viable option, as it would be difficult to achieve consensus on exactly which types should be present, given, for example, the various different syntactic and semantic theories on which different tools are based.
As a partial solution to fostering interoperability between resources developed by different groups, U-Compare has defined a sharable type system, which aims to define the most common types of annotation, both syntactic and semantic, that are produced by NLP applications. The idea is that all components available in U-Compare should produce annotations that are compatible with this type system. As the U-Compare type system consists of fairly general types, it is permissible to create new types that correspond to more specialised types of annotations, as long as these new types can form sub-types of one of the existing U-Compare types. This ensures that compatibility between components developed by different groups can at least be achieved at an intermediate level of the hierarchy.
U-Compare is distributed with over 50 components, constituting what is currently the world’s largest library of type-compatible UIMA components. These components include sentence splitters, tokenisers, POS taggers, abbreviation detectors, named entity recognisers, etc., which currently have a strong bias towards biomedical text processing. All of these components are compatible with the U-Compare type system, providing evidence that conforming to a shared type system to enhance interoperability is a feasible task.
U-Compare has been used in many tasks by both NLP experts and non-expert users, from the individual level to worldwide challenges. They include the BioNLP’09 shared task [16] for the extraction of bio-molecular events (bio-events) that appear in biomedical literature, in which U-Compare served as an official support system, the CoNLL-2010 shared task on the detection of speculation in biomedical texts [17], the BioCreative II.5 challenge [18] of text mining and information extraction systems applied to the biological domain, and linking with Taverna [19], a generic workflow management system.
U-Compare and META-SHARE
As part of META-NET, we want to demonstrate that incorporating U-Compare functionality within the META-SHARE platform would be a huge asset, in order to allow users not only to search for and download individual resources, but also to build and evaluate complete systems in a very simple way.
The current library of U-Compare components deals almost exclusively with the processing of English text, with a handful of Japanese components. In order to illustrate the feasibility of integrating U-Compare within META-SHARE, we plan to expand the U-Compare library to include components corresponding to resources in other European languages. As a starting point for this activity, resources will be provided by project partners covering the languages of Portuguese, Spanish, Catalan, Galician, Romanian and Maltese. According to META-NET’s aim to enable enhanced communication and co-operation across languages, cross-lingual components to carry out tasks such as automatic translation should also be made available. In order to allow such components to be integrated into workflows, it will also be necessary to extend the functionality of U-Compare to facilitate the display of multi-lingual annotations.
The creation of UIMA components for several different languages will present an interesting challenge for the U-Compare type system, allowing us to verify to what extent it is language-independent, or whether it is feasible to make it so, and what sort of changes will be necessary. If a language-independent type system is indeed possible, this would certainly increase the feasibility of adopting U-Compare and its type system as a standard to be followed in META-SHARE, and beyond.
In order to showcase the potential and versatility of U-Compare and UIMA in contributing towards the rapid development of the European language technology landscape, our work will include the creation of a set of workflows, ranging from simple to complex, that make use of these new components, both monolingual and cross-lingual. Making such workflows available within META-SHARE will allow them to act as templates for carrying out important language processing tasks, which can be changed or configured according to the requirements of different types of applications. By providing facilities for META-SHARE users to make their own workflows available to other uses, and to provide feedback about existing workflows, the process of creating new applications could become even easier.
Conclusion
META-NET is an exciting and challenging initiative which, through the backing and guidance of a large and dedicated community of interested stakeholders, has the potential to make an impact on the lives of all European citizens. This impact will be realised through the increased availability of language technology applications that make it easier to obtain information and knowledge, without the concern of language barriers.
Since many of these language technology applications share common processing steps, such as tokenisation, part-of-speech tagging, syntactic parsing, etc., the ready availability of tools that can carry out these processing steps in different languages and for different types of text is a prerequisite for facilitating a rapid increase in the availability of new applications. The launch of META-SHARE, which will allow numerous repositories of language resources to be queried along multiple criteria from a single point of access, will help to achieve this requirement.
The speed and ease with which new applications can be developed using component language resources is also heavily dependent on the amount of work that must be performed by system developers to allow components to communicate with each other in the correct manner. We have described how, by wrapping resources as UIMA components whose annotation types conform to the U-Compare type system, greater interoperability of the resources, and with it, easier reuse and more flexible combination, can be achieved.
As part of META-NET, we are intending to create a significant number of new UIMA components that are compatible with the U-Compare type system for a number of different European languages, together with a range of example workflows that make use of these new components. It is hoped that the planned integration of the U-Compare system within META-SHARE will contribute to a more rapid and straightforward expansion of the European language technology landscape, by allowing users to benefit from both running and configuring existing workflows, and creating new workflows with only a few clicks of their mouse, and without the need to write any new program code.
Acknowledgements
The work described here is being funded by the DG INFSO of the European Commission through the ICT Policy Support Programme, Grant agreement no. 270893 (METANET4U).
References
- META-NET Network of Excellence http://www.meta-net.eu/
- Ferrucci D, Lally A, Gruhl D, Epstein E, Schor M, Murdock JW, Frenkiel A, Brown EW, Hampp T, Doganata Y., “Towards an Interoperability Standard for Text and Multi-Modal Analytics”. IBM Research Report RC24122 2006.
- The UIMA framework http://uima.apache.org/
- Kano Y, Miwa M, Cohen KB, Hunter LE, Ananiadou S, Tsujii J, “U-Compare: A modular NLP workflow construction and evaluation system”. IBM Journal of Research and Development 2011, 55(3):11:1-11:10.
- Kano Y, Baumgartner WA, Jr., McCrohon L, Ananiadou S, Cohen KB, Hunter L, Tsujii J, “U-Compare: share and compare text mining tools with UIMA”. Bioinformatics 2009, 25(15):1997-1998.
- U-Compare: a UIMA compliant integrated natural language processing platform and resources http://u-compare.org/
- User language preferences online: Analytical report, May 2011 (PDf format, 3.62 Mb) http://ec.europa.eu/public_opinion/flash/fl_313_en.pdf
- Copestake A, Corbett P, Murray-Rust P, Rupp CJ, Siddharthan A, Teufel S, Waldron B, “An architecture for language processing for scientific texts”. In: Proceedings of the UK e-Science All Hands Meeting 2006. 2006.
- Cunningham DH, Maynard DD, Bontcheva DK, Tablan MV, “GATE: A framework and graphical development environment for robust NLP tools and applications”. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02). 2002: 168-175.
- Laprun C, Fiscus J, Garofolo J, Pajot S, “A practical introduction to ATLAS”. In: Proceedings of the 3rd LREC Conference. 2002: 1928 1932.
- van Rijsbergen, C.J., Information Retrieval. 2nd edition. Butterworth.1979
- Kolluru B, Hawizy L, Murray-Rust P, Tsujii J, Ananiadou S, “Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry”. PLoS ONE 2011, 6(5):e20181.
- Corbett P, Murray-Rust P, “High-throughput identification of chemistry in life science texts”. Computational Life Sciences II 2006:107-118.
- Baumgartner WA, Cohen KB, Hunter L, “An open-source framework for large-scale, flexible evaluation of biomedical text mining systems”.Journal of Biomedical Discovery and Collaboration 2008, 3:1.
- UIMA Component repository http://uima.lti.cs.cmu.edu
- Kim J-D, Ohta T, Pyysalo S, Kano Y, Tsujii Ji, “Overview of BioNLP’09 Shared Task on Event Extraction”. In: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task: 2009; 2009: 1-9.
- Farkas R, Vincze V, M ra G, Csirik J, Szarvas G, “The CoNLL-2010 shared task: learning to detect hedges and their scope in natural language text”. In: Proceedings of the Fourteenth Conference on Computational Natural Language Learning—Shared Task. Association for Computational Linguistics; 2010: 1-12.
- Sætre R, Yoshida K, Miwa M, Matsuzaki T, Kano Y, Tsujii J, “AkaneRE Relation Extraction: Protein Interaction and Normalization in the BioCreAtIvE II. 5 Challenge”. In: Proceedings of BioCreative II 5 Workshop 2009 special session| Digital Annotations. 2009: 33.
- Kano Y, Dobson P, Nakanishi M, Tsujii J, Ananiadou S, “Text mining meets workflow: linking U-Compare with Taverna”. Bioinformatics 2010, 26(19):2486-2487.
Author Details
Sophia Ananiadou
Professor
School of Computer Science
University of Manchester
and
Director
National Centre for Text Mining
Manchester Interdisciplinary Biocentre
131 Princess St
Manchester M1 7DN
Email: Sophia.Ananiadou@manchester.ac.uk
Web site: http://www.cs.manchester.ac.uk/
Web site: http://www.nactem.ac.uk/
Paul Thompson
Research Associate
National Centre for Text Mining
School of Computer Science
University of Manchester
Manchester Interdisciplinary Biocentre
131 Princess St
Manchester M1 7DN
Email: Paul.Thompson@manchester.ac.uk
Web site: http://www.nactem.ac.uk/
Web site: http://personalpages.manchester.ac.uk/staff/paul.thompson/
Yoshinobu Kano
Project Assistant Professor
Database Center for Life Science (DBCLS)
Faculty of Engineering Bldg. 12
The University of Tokyo
2-11-16, Yayoi, Bunkyo-ku
Tokyo
113-0032, JAPAN
Email: kano@dbcls.rois.ac.jp
Web site: http://u-compare.org/kano/
John McNaught
Deputy Director
National Centre for Text Mining
School of Computer Science
University of Manchester
Manchester Interdisciplinary Biocentre
131 Princess Street
Manchester M1 7DN
UK
Email: John.McNaught@manchester.ac.uk
Web site: http://www.nactem.ac.uk/
Web site: http://www.nactem.ac.uk/profile.php?member=jmcnaught
Teresa K Attwood
Professor
Faculty of Life Sciences & School of Computer Science
Michael Smith Building
The University of Manchester
Oxford Road
Manchester M13 9PT
Email: teresa.k.attwood@manchester.ac.uk
Web site: http://www.bioinf.manchester.ac.uk/dbbrowser/
Philip J R Day
Reader in Quantitative Analytical Genomics
Manchester Interdisciplinary Biocentre
University of Manchester
131 Princess Street
Manchester M1 7DN
Email: philip.j.day@manchester.ac.uk
John Keane
MG Singh Chair in Data Engineering
School of Computer Science
University of Manchester
Manchester Interdisciplinary Biocentre
131 Princess St
Manchester M1 7DN
Email: John.Keane@manchester.ac.uk
Dean Andrew Jackson
Professor Cell Biology
Faculty of Life Sciences
Michael Smith Building
The University of Manchester
Oxford Road
Manchester M13 9PT
Email: dean.jackson@manchester.ac.uk
Web site: http://www.manchester.ac.uk/research/Dean.jackson/
Steve Pettifer
Senior Lecturer
School of Computer Science
Kilburn Building
The University of Manchester
Oxford Road
Manchester M13 9PL
Email: steve.pettifer@manchester.ac.uk
Web site: http://aig.cs.man.ac.uk/people/srp/