TECkit Language Reference

This reference is not a full reference, but a partial one. It describes only that part of the TECkit language that is relevant for how Bibledit uses it. It was assembled while looking at and re-using the information provided with the TECkit package. This information is used with permission of the author of TECkit, and of SIL.

Introduction

The TECkit language is built around simple mapping rules where a Unicode character on the left-hand side of the rule is mapped to or from a Unicode character on the right-hand side. From this basic structure, mapping rules can be extended by the use of character sequences rather than single characters on either side; by the addition of contextual constraints (environments) determining when a rule should apply; and by the use of character classes, optional and repeatable elements, grouping and alternation to express more complex patterns to be matched and processed.

The TECkit package, as used in Bibledit, is applied to text processing operations entirely dealing with Unicode data.

File structure and conventions

A TECkit description file is strictly line-oriented; every statement is confined to a single logical line. To allow long rules to be broken across several lines, for easier editing, TECkit interprets a final backslash (\) as a “continuation character”; however, only quite complex mappings are likely to need rules that cannot readily be expressed in a single source line.

The semicolon (;) introduces a comment that continues to the end of the (physical) line. TECkit ignores everything following a semicolon, unless it is in a quoted string.

Built-in keywords in the TECkit mapping language are not case-sensitive; the compiler will accept any mixture of upper and lower case. This also applies to Unicode character names. More about this later. However, the names of character classes defined in the file itself are case-sensitive, and must be used in a consistent form. More about the character classes later too.

Where “strings” are called for, these may be either single- or double-quoted. There is no mechanism to “escape” quote marks embedded in the string; therefore, a single-quoted string can contain double-quote characters, and vice versa, but it is not possible to include both single and double quotes in the same quoted string.

Unicode character codes are expressed either numerically or using Unicode character names, converted into unique “identifiers” by replacing spaces and hyphens with underscores. TECkit knows a great lot Unicode character names. The preferred form is to write “U+xxxx”, where xxxx represents four to six hexadecimal digits. Normal decimal or hex numbers are also permitted.

Characters may also be expressed as quoted literals. If the mapping source is Unicode text, then they may be used only for Unicode character values. It is never legal to use quoted literals on both sides of the mapping.

A complete TECkit mapping description consists of a header section followed by one or more mapping passes. The simplest Unicode mapping descriptions will contain just one Unicode pass, but for some complex mappings it may be necessary to perform pre- and/or post-processing such as character reordering in other passes. The LHS code space of each pass must correspond to the RHS code space of the pass before it.

Header information

The mapping file begins with header information, which consists of a number of pieces of information about the encoding and mapping, each specified by a keyword followed by a quoted string:

EncodingName

A name that uniquely identifies this mapping table.

DescriptiveName

A string that describes the mapping.

Version

The version of the mapping description.

Contact

Contact information.

RegistrationAuthority

The organization responsible for the encoding.

RegistrationName

The name and version of the mapping.

Copyright

Copyright information.

Only the encoding name is required.

An alternative form of header should be used for mapping descriptions that do transliterations entirely within Unicode. Instead of EncodingName and DescriptiveName, the following four fields are used:

LHSName

Canonical name of the “source” encoding or left-hand side of the description.

RHSName

Canonical name of the “target” encoding or right-hand side of the description.

LHSDescription

Description for the left-hand side of the mapping.

RHSDescription

Description for the right-hand side of the mapping.

Note that while we sometimes think of the left-hand side of the description as “source” and the right-hand side as “target”, TECkit descriptions and mapping tables are bi-directional, and thus these roles can equally well be exchanged.

Finally, the file header can include “flags” that specify certain features of the encoding for both the left- and right-hand sides of the mapping.

LHSFlags ( list-of-flags )

Features of the LHS encoding.

RHSFlags ( list-of-flags )

Features of the RHS encoding.

For each side of the mapping, zero or more of the following flags can be specified:

ExpectsNFC

Input on this side of the mapping should be in fully-composed form.

ExpectsNFD

Input on this side of the mapping should be in fully-decomposed form.

GeneratesNFC

Output on this side of the mapping is fully-composed.

GeneratesNFD

Output on this side of the mapping is fully-decomposed.

VisualOrder

This side of the mapping deals with visual (rather than logical) text order.

The “expects” flags can be used to specify that Unicode input to this side of the mapping should be normalized before it is presented to the actual mapping rules. By specifying a normalization form for the Unicode side of a mapping description, the author can write mapping rules assuming a particular canonical representation. The TECkit engine will take care of normalizing the input text so that it matches the expectation of the rules.

The “generates” flags allow the mapping author to declare which normalization form will be produced by the mapping rules. However, as it can be difficult to ensure the accuracy of this, TECkit does not “trust” this flag, but always explicitly normalizes the output if requested by the application using the mapping.

A typical example of the header information might be:

  EncodingName          "Bibledit-NdebeleDiglot-2008"
  DescriptiveName       "Simple rules for transliteration"
  Version               "1"
  Contact               "mailto:author@domain.org"
  RegistrationAuthority "Bibledit International Ltd."
  RegistrationName      "Bibledit Ndebele Diglot"
  Copyright             "(c)2008 The Author (released under GPL3"

  LHSFlags              ()
  RHSFlags              (ExpectsNFD GeneratesNFD)

Mapping passes

The heart of a mapping description is the series of mapping passes that relate characters or sequences on the LHS to those on the RHS. In simple cases there is just one pass.

Each pass begins with a header line that declares the encoding space in which it operates:

  pass( pass-type )

where pass-type is one of:

  Byte
  Unicode
  Byte_Unicode
  Unicode_Byte

As Bibledit only works with UTF-8 encoded data, the pass-type should always be Unicode.

There are also special “normalization pass” types that can be used in special cases. To create a normalization pass, specify pass-type as one of:

  NFC_fwd
  NFD_fwd
  NFC_rev
  NFD_rev
  NFC
  NFD

As the names suggest, these apply the NFC or NFD Unicode normalization forms as part of the forward, reverse, or both processing “pipelines”. Most mappings will not need to include explicit normalization passes, as the ExpectsNFC or ExpectsNFD flag can be used to request pre-normalization of Unicode data before any mapping rules are applied, and applications using TECkit can explicitly request either NFC or NFD data when mapping to Unicode. The only reason to use a normalization pass in a mapping description would be to ensure that data is in a particular normalization form somewhere in the middle of a multi-pass Unicode transduction.

Class definitions

Character classes may be used to make the mapping description more readable and concise; suitable class definitions allow a single rule to express a whole set of related mappings. They are typically used in contextual constraints or as elements of rules that reorder character sequences.

Classes are defined with the UniClass statement:

  UniClass [ name ] = ( unicodeSequence )

Class names, always enclosed in square brackets, are “identifiers” that may contain letters, digits, and the underscore character; they may not begin with a digit. Unlike the keywords of the TECkit language, they are case-sensitive. The Unicode sequence is a space-separated list of character codes, similar to those used in mapping rules (see below), with the addition of a “range” notation: two character codes separated by .. represent the complete set of characters from the first to the second (inclusive).

Some examples:

  UniClass [control] = ( U+0000..U+001f U+007f )
  UniClass [letter] = ( U+0041..U+005a U+0061..U+007a )

Defaults for unmapped characters

In Unicode passes, any characters not explicitly matched by mapping rules will be output unchanged.

Mapping rules

The actual mapping from Unicode to Unicode is expressed as a list of mapping rules. A mapping description actually contains two complete sets of mapping rules, one set that matches characters in the first Unicode text and generates Unicode, and the other set that match Unicode characters in the second text and generate Unicode for the first text. However, in most cases it is simplest to express both mappings at once, using bi-directional rules where either side of the rule can act as “match” with the other being “replacement”.

The general form of a mapping rule is:

  lhsSeq [ / lhsContext ] operator rhsSeq [ / rhsContext ]

Here, operator indicates whether this rule is to be used only when mapping from the left-hand side to the right, from the right-hand side to the left, or (the most common case) in both directions:

  <>   bidirectional mapping rule
  >    unidirectional LHS-to-RHS rule
  <    unidirectional RHS-to-LHS rule

The lhsSeq and rhsSeq parts of the rule are simple lists of character codes. These may be expressed as decimal numbers or as hexadecimal (prefixed with 0x). In Unicode sequences, characters may also be listed by their Unicode character names as found in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt, with all non-alphanumeric characters in the names converted to underscores; thus, for example, thai_character_ko_kai may be used instead of 0x0E01 to make the mapping description file more self-documenting. The Unicode character names are not case-sensitive.

During the mapping operation, whichever of lhsSeq or rhsSeq corresponds to the input side of the rule can be considered a “match string”, with the other being its “replacement”. The context associated with the match string, if any, acts as a constraint on the application of the rule. Any context associated with the replacement is irrelevant; it would be used when mapping in the other direction.

Character class references may be used in the match and replacement sequences, although for clarity it may be better to list each individual character mapping. If a class is used on the replacement side of a rule, it must correspond to a class on the match side, and the resulting rules will map each character in the match class to the equivalent character in the replacement class. The classes must contain the same number of characters. Item tags, see below, may be used to associate the replacement class item with its corresponding match item; in the absence of such tags, items are matched by position within the match and replacement strings.

For contextually constrained mappings, the lhsContext and rhsContext parts of the mapping rule are used. These use a “slash … underscore” notation:

  / preContextSeq _ postContextSeq

The match and replace strings and the pre- and post-contexts may be simple sequences of character codes, or may be more complex expressions using the following “regular expression” elements:

  [cls]    match any character from the class cls
  .        match any single character
  #        match beginning or end of input text
  ^item    ‘not item’: match anything except the given item

The 'not item' applies to single items only; negated groups are not supported.

  (...)    grouping (for optionality or repeat counts)
  |        alternation (within group): match either preceding or following sequence
  {a,b}    match preceding item minimum a times, maximum b (0 ≤ a ≤ b ≤ 15)
  ?        match preceding item 0 or 1 times
  *        match preceding item 0 to 15 times
  +        match preceding item 1 to 15 times
  =tag     tag preceding item for match/replacement association
  @tag     duplicate the tagged item (including groups) from LHS

The @tag can only occur on RHS. It is typically used to implement reordering.

A couple of notes on the use of regular expressions and context constraints:

Rules are tested from the most to the least specific, where a longer rule (counting the length of context as well as the actual match string) is considered more specific than a shorter one. If there are two equally long rules that could match at a particular place in the input, the first one listed in the mapping description file will be used.

The maximum potential length of any pre-context (considering all repeat counts) in a pass, plus the maximum potential match string, plus the maximum potential post-context, must not exceed 255 characters. Similarly, the maximum output that can be generated from any rule is limited to 255 characters.

Macros

TECkit supports a simple macro facility; this may be used to define symbols that act as “shorthand” for frequently-used fragments of a mapping description, such as character classes that are needed in multiple passes, or sequences used in the context of multiple rules.

A macro is defined with a line of the form:

  Define name <arbitrary TECkit source>

Following such a line, wherever such names are found in the description, these are treated as representing the specified source texts.

This is particularly useful when the context is complex, perhaps involving several alternatives or multiple repeatable items; suitably descriptive macro names may also serve to make the mapping description more self-documenting.

Another use for macros is to provide more convenient names for Unicode characters. This can help make mapping descriptions more readable.

Note that macros must be defined before they are used, including any use in the definition of other macros; thus, it is legitimate to say:

Define NUL   0x00
Define DEL   0x7F
Define ASCII NUL..DEL

But with the definitions rearranged so that NUL and DEL are not defined when they are used in the definition of ASCII, even if they are defined subsequently, the result will be a compile-time error:

Define ASCII NUL..DEL
Define NUL   0x00
Define DEL   0x7F
ByteClass[asc] = (ASCII)

This will generate an error on the ByteClass line, because the identifiers NUL and DEL found in the expansion of ASCII will be considered undefined.

Unicode-only mappings

The TECkit system, while targeted primarily at byte/Unicode conversion, is used by Bibledit for Unicode mapping operations. A mapping description need not contain a Byte_Unicode pass at all. If it contains only Unicode passes, both input and output are Unicode data.

Example

A simple example of a TECkit mapping may help you to start writing your own quickly.

EncodingName "Example"
pass (Unicode)
; This simple example transliterates the Latin characters a and b
; to the Greek α and β.
U+0061  >  U+03B1
U+0062  >  U+03B2

End of example.