Unicode and Historic Scripts
Many digital versions of texts–whether they be the plays of Aeschylus, or stories from this week’s Times–can now be accessed by a worldwide audience, thanks to the Internet and developments in international standards and the computer industry. But while modern newspapers in English and even the Greek plays of Aeschylus can be viewed on the Internet in their original script, reading articles that cite a line of original text in Egyptian hieroglyphs is more problematic, for this script has not yet been included in the international character encoding standard Unicode. Because Egyptian hieroglyphs and several other historic scripts are not yet included in the Unicode Standard (“Unicode” henceforth), reading or writing documents with these scripts can be difficult or even impossible on the computer and over the Internet. This situation has important ramifications for the future of scholarly discourse, teaching, online publishing, and accessing and preserving our cultural heritage generally.
Recent developments seem to suggest that help is on the way: Linear B and Cypriot (early forms of Greek), Ugaritic cuneiform, and Old Italic (used for Etruscan and other languages from ancient Italy) are now included in Unicode 4.0 [1]. Still, over ninety scripts are known to be missing, two-thirds of which are historic scripts, the remainder being modern minority scripts [2]. Since these scripts are lesser-known and are used by smaller groups whose financial and/or political clout may not be considered significant, work to get them included in Unicode is slow and poorly funded. Corporations, which make up the majority of members of the Unicode Consortium, and national bodies are showing diminishing interest in finishing the task of encoding scripts of the world. Hence participation from scholars and support from public and private institutions are critical in order to guarantee electronic accessibility to the historic scripts. Should such scripts be omitted from the Standard, access to texts in these scripts will prove difficult. This brief introduction will describe the Unicode Standard and give an overview of how it is organised, how to get it to work, provide a list of missing scripts and offer suggestions on how to help the effort to get the scripts into Unicode.
Figure 1: Sample of Chalukya script (“Box-Headed” script)
Figure 2: Sample of Balti script
What is Unicode and why is it important?
Unicode is an international character encoding standard. With Unicode, each character of a script receives a unique number that remains the same, regardless of the computer platform, software, or the language. For example, the letter Latin capital letter “A” has the value (or “character code”) U+0041 in Unicode. The value assigned to “A” is typically given in hexadecimal notation and preceded by the notation “U+”. It is this number that the computer stores and uses to refer to “A”; such character codes underlie word-processing (such as MS Word) and Web documents.
In a multi-layered model of text data, character encoding lies on the bottom layer, being just a series of character codes. It is a “plain text” representation, since it has no formatting information, such as font size, bold style, etc., which are found in “rich text” or “fancy text.” Above this level of character encoding is markup, such as HTML, XML, or TEI (Text Encoding Initiative), while metadata sits on the topmost layer.
Figure 3: Multi-layered Model of Text Data
Unicode is synchronised with ISO/IEC 10646, the parallel International Standard that is maintained by the International Organization for Standardization (ISO). Because Unicode is widely supported by the computer industry and national bodies, text data is accessible across platforms and computer programs and will remain stable through time.
One goal of Unicode is to cover all the scripts of the world, both historic and modern; it has enough space to cover over one million characters. Attaining universal coverage of the world’s scripts will help users be able to access and use any script in email, Web pages, electronic versions of documents, etc. With the release of Unicode 4.0, over 96,000 characters are encoded, covering a large number of scripts and their languages [3]. But sending email or documents across computer platforms in scripts currently missing from the Standard can cause problems. Since there is no standardised assignment of codepoints, problems can occur when transmitting data (i.e., my “A” could appear as an “F” on another user’s computer screen). Ideally, the missing scripts should be included in Unicode.
Organisation of Unicode and Underlying Concepts
Familiarity with the organisation of Unicode and its underlying concepts are key to being able to take full advantage of Unicode, and will help to avoid frustration and confusion. A brief overview is provided here; the best place to go for information is to the Unicode Standard itself, either in its print version (The Unicode Standard 4.0), or in the PDF version posted on the Unicode Consortium Web site [4]. The Unicode Consortium Web site additionally includes technical reports and annexes, as well as a wealth of other useful information for those new to Unicode and for more advanced users. Because the Web site includes the latest information on recently approved characters and scripts, it should be consulted first, and references below are hence made to the Web pages.
In Unicode, characters are organized into blocks of scripts (“Greek” and “Cyrillic”) or groups of similar characters (i.e., arrows). A listing of the various script blocks is included on the “Code charts” page of the Web site [5]. Some characters may be in different blocks, so users may need to look around for appropriate characters. Several scripts are contained in more than one block. For example, the Latin script, which is used for a wide variety of languages, is covered by five different blocks: Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, and Latin Extended Additional. Punctuation, which is used across a variety of scripts, can appear in its own block.
Each code chart is followed by a list of names of the characters, their codepoints, a representative picture of the character (“glyph”), and often additional information, such as alternative names, cross references to other characters, etc. Background information on the various scripts is included in the first part of the book, with some implementation and usage guidelines.
While looking through the code charts, a number of important concepts need to be taken into consideration:
- Unicode encodes characters, not glyphs. Characters are abstract and reflect “the smallest components of written language that have semantic value” (The Unicode Standard 4.0, p. 15), whereas glyphs are the surface representations of characters. Glyphs appear on the printed page or on your monitor. Unicode is concerned with the abstract characters (such as the lower case letter Latin “a”), the font provides the glyphs (a cursive “a”, a Times New Roman “a”, etc.) Determining a character versus a glyph can be difficult when working with historic texts, because the entire set of characters may not be known or there can be controversy on the details of a script.
Figure 4: Unicode’s domain vs. the font’s domain
- Some letters or symbols in a script may be created by more than one character. Unicode offers a productive means of composing characters by using combinations of a base letter and a combining diacritic (for example, if one needed an “M” with a macron above it, it is covered by the capital letter “M” U+004D and a combining macron, U+0304).
- The code charts reflect only representative images (glyphs) of the letter or symbol, and these are not intended to be definitive. Even though the picture in the code chart may not be identical to the one needed, the place to go for the desired glyph shape is in the font, not in Unicode.
- Abbreviations, ligatures, variants, and idiosyncratic scribal marks are not likely to be included in Unicode, nor are precomposed characters (characters that can be de-composed into more than one character).
Further details on finding a character, are provided on the Unicode Consortium “Where is my Character?” Web page [6].
Scripts that are in the process of being proposed are also listed on the Unicode Consortium Web site [7]. Scripts that have not yet been formally proposed or are without a proposal, are listed on the “Roadmaps” Web page on the Unicode Consortium Web site [8]. It is important to check Planes 0, 1, and 2. Scripts in red are missing detailed proposals, those in blue have proposals submitted to one of the two international standards bodies, the Unicode Technical Committee or the Working Group 2 (WG2) of the ISO/IEC Joint Technical Committee 1 Subcommittee 2 (JTC1/SC2). (Both must ultimately approve the proposal).
Practical Issues: Getting Unicode to Work
Having a script in Unicode does not guarantee that its characters can automatically be used in an email message or a word processing program. Unicode is the underlying standard upon which fonts, keyboards, and software are based, so in order to be able to type and send documents with these scripts and their characters, users need to have Unicode-compliant products. Because the Standard is continually evolving with more scripts and characters being added, using the most recent fonts and stable software available will, in general, provide better support. A listing of such products is posted on the Unicode Enabled Products Web page [9] and on Alan Wood’s Unicode Resources Web site [10].
In general, the following are needed:
- A recent operating system (Mac OS X; Windows 2000, XP; Gnu/Linux with glibc 2.2.2 or newer)
- A recent browser (IE, Safari, OmniWeb, Mozilla/Netscape)
- A Unicode text editor (Word 2000, 2002, Unipad, Apple “TextEdit”)
- An input mechanism (a keyboard, the “insert symbol” mechanism, Apple’s Character Palette)
- A Unicode-enabled font (Arial Unicode MS, Lucida Sans Unicode, Code 2000) (Note: Fonts advertised as “Unicode-compliant” may only cover certain scripts or parts of the repertoire, so users should be cautious)
If a particular script is not in Unicode, a number of short-term solutions are available, including transliterations or transcriptions in an already encoded script. Further options are the creation of non-standard fonts or the use of images (i.e. GIFs). Again, such measures do not provide a long-term solution for creating access to the original script.
Status of Historic Scripts Missing in Unicode
Historic scripts missing in Unicode are listed in the chart below. “Historic” refers to scripts that are extinct or reserved for liturgical use. A single asterisk (*) indicates a proposal has been written for this script (but additional work needs to be done), two asterisks () indicates the proposal has been approved by one of the two standardising bodies. Links to the extant proposals is provided on the Script Encoding Initiative Web site [2]. There are often characters missing from scripts which have been already encoded (i.e., missing alchemical signs, Tibetan punctuation marks, etc.). Work on these is ongoing; the list below refers to entire scripts that are missing.
Missing Historic Scripts
Ahom | Glagolitic | Luwian | Phoenician |
How to Help
The door for getting scripts into Unicode 5.0 will remain open for perhaps one more year, until 2004. After this point it will be increasingly difficult to get scripts into the Standard because of decreased interest in encoding the ‘lesser-known’ and ‘lesser-used’ scripts by large corporations and various national bodies. Hence, it is incumbent upon the digital library community, scholars, and other interested users to become involved early and support this effort. Scripts left out of the Standard will mean that documents that use these scripts–many of which make up our cultural and literary heritage–will be difficult to access in the future.
The most pressing needs in the effort to encode missing scripts are publicity, particularly geared toward public and private agencies, and funding so that veteran script proposal authors can write Unicode proposals and work on the development of free fonts. In response to these needs, I began the Script Encoding Initiative at UC Berkeley [11]. SEI was established to raise funds for script encoding proposal authors (including the world’s foremost script encoding author, Michael Everson) to work on proposals, and to develop fonts. The project has Unicode Technical Directors on its advisory board to help assure that the script proposals move smoothly through the proposal process.
Participation of scholars in the Unicode process is also needed. Proposals should be reviewed by scholars who are familiar with the scripts. A set of “best practices” guidelines for using Unicode with particular scripts is also a desideratum. Moreover, letters of support from the academic field should be sent to the standards bodies, verifying the final proposals are sound and that the encoding for a script is necessary.
Figure 5: Sample of Rongo Rongo script
Conclusion
The Unicode Standard represents a tremendous advance for multilingual computing, but more work is needed in order to achieve a truly multilingual Web. The impact will be manifold: it will help in the publication and preservation of materials in ancient and historic scripts, facilitate the online teaching of languages using such scripts, and provide universal access to our cultural and literary heritage.
Acknowledgements
All script samples were kindly provided by Michael Everson of Everson Typography.
Note: This article has drawn heavily upon a paper given 25 November 2002 at a panel co-sponsored by the Oriental Institute, University of Chicago, and the Society of Biblical Literature on “Electronic Markup and Publication of Ancient Near Eastern Texts” in Toronto, Canada.
References
- Historic scripts encoded in Version 4.0 of the Unicode Standard, (PDF file excerpt)
http://www.unicode.org/versions/Unicode4.0.0/ch13.pdf - Alphabetical list of scripts missing from Unicode, Scripts Encoding Initiative. http://linguistics.berkeley.edu/~dwanders/alpha-script-list.html
- Unicode Consortium languages and scripts page
http://www.unicode.org/onlinedat/languages-scripts.html - Unicode Consortium 4.0.0 Web page
http://www.unicode.org/versions/Unicode4.0.0/ - Unicode Consortium code charts page
http://www.unicode.org/charts/ - Unicode Consortium “Where is my Character?” Web page
http://www.unicode.org/unicode/standard/where/ - Unicode Consortium Proposed New Scripts Web Page
http://www.unicode.org/pending/pending.html - Unicode Consortium Roadmaps page
http://www.unicode.org/roadmaps/ - Unicode Consortium Enabled Products Web page
http://www.unicode.org/onlinedat/products.html - Alan Wood’s Unicode Resources Web site
http://www.alanwood.net/unicode/ - Script Encoding Initiative homepage
http://linguistics.berkeley.edu/~dwanders/
Author Details
Deborah Anderson
Researcher, Dept. of Linguistics
University of California, Berkeley
California,
USA.
Email: dwanders@socrates.berkeley.edu
Web site: http://www.linguistics.berkeley.edu/~dwanders
Article Title: “Unicode and Historic Scripts”
Author: Deborah Anderson
Publication Date: 30-October-2003
Publication: Ariadne Issue 37
Originating URL: http://www.ariadne.ac.uk/issue37/anderson/