Enabling Unicode Sinhala (සිංහල) in GNU/Linux HOWTO

Harshula Jayasuriya

<harshula�at�gmail�dot�com>

Revision History
Revision 9	2011/08/29

Revision 8	2009/11/26

Revision 7	2008/09/01

Revision 6	2007/05/13

Revision 5	2006/11/20

Revision 4	2006/04/03

Revision 3	2005/03/06

Revision 2	2004/10/07

Revision 1	2004/06/05

Table of Contents

1. About

2. Introduction

2.1. Learn about Sinhala
2.2. Learn about Unicode
2.3. Learn about Standards
2.4. Mailing Lists
2.5. Contributors

3. Enabling Sinhala

3.1. Short Answer
3.2. Long Answer

4. Developer Notes

4.1. Open Type Fonts
4.2. Renderer (Layout Engine)
4.3. Firefox/Mozilla
4.4. Open Office
4.5. Input Methods
4.6. Databases
4.7. Locales
4.8. Translations
4.9. Packaging
4.10. DONE
4.11. TODO

5. Resources

5.1. Input Methods
5.2. Internationalization
5.3. Localization
5.4. Sinhala
5.5. Typography
5.6. Unicode

6. Conclusion

Sinhala is the language spoken by the majority of Sri Lanka's population. This guide describes the level of Sinhala support available in GNU/Linux. It also describes how to enable improved Sinhala support and the tasks that still require attention.

This guide is GNOME and Debian/Ubuntu centric. Most of the explanations and suggestions should also be applicable to other distributions.

2. Introduction

The level of Sinhala support in GNU/Linux distributions has improved significantly. Most modern distributions contain the required Unicode Sinhala (SLS1134) font and Sinhala input methods (keyboard layouts). Usually it is simply a matter of installing the relevant packages by following the instructions in Section 3.

Debian 5.0 (Lenny), Fedora 10 (Cambridge) and Ubuntu 8.10 (Intrepid) contain Unicode Sinhala support on the desktop. This includes the ability to read Unicode Sinhala websites in Firefox, read and write Unicode Sinhala documents in Open Office and send Unicode Sinhala emails in Evolution.

2.1. Learn about Sinhala

http://en.wikipedia.org/wiki/Sinhala
http://www.speaksinhala.com
Silva, A.W.L. (2003). SINHALESE for Beginners. Sri Lanka: Pubudu Printers Kandana
- The first chapter has excellent coverage of the Sinhala letters.

2.2. Learn about Unicode

2.3. Learn about Standards

2.4. Mailing Lists

The Sinhala GNU/Linux users and developers use these mailing lists for announcements, discussions, debugging and reviews.

Therefore, the archives of the mailing lists contain very useful information for both new users and new developers.

Please search the archives for the answers to your questions before posting to the mailing lists.

2.4.1. Sinhala Technical List (in English for developers)

2.4.2. Sinhala Unicode List (in Sinhala for developers and users)

http://groups.google.com/group/sinhala-unicode

2.5. Contributors

Many individuals have contributed to the project that began in 2003 on an lug.lk mailing list. Some of the notable contributors are:

Dushara Jayasinghe
Harshula Jayasuriya
Chamath Keppitiyagama
Danishka Navin
Anuradha Ratnaweera
Harsha Senanayake
Naoto Takahashi
Daiki Ueno
Steven White

3. Enabling Sinhala

3.1. Short Answer

3.1.1. Debian 6.0 (Squeeze) and Above (may work on older versions)

As root/superuser run:

apt-get install ttf-sinhala-lklug ibus im-switch ibus-m17n m17n-db m17n-contrib

From your user account (i.e. not root) run:

rm -f ~/.xinput.d/* ; im-switch -z all_ALL -s ibus

Logout and login again. Environment variables need to be set/updated (NO NEED TO REBOOT).
From your user account (i.e. not root) select your keyboard layouts by running:
ibus-setup

In Debian Wheezy IBus may depend on im-config. im-config conflicts with im-switch, so you will need to run im-config instead of im-switch.

3.1.2. Fedora 10 (Cambridge) and Above (may work on older versions)

As root/superuser run:
yum groupinstall sinhala-support
From your user account (i.e. not root), select and configure the Input Method by running:
im-chooser

3.1.3. Ubuntu 9.10 (Karmic) and Above (may work on older versions)

The Universe repository should already be enabled (https://wiki.ubuntu.com/AlwaysEnableUniverseMultiverse). If not, first enable the Universe repository.

As root/superuser run:

apt-get install ttf-sinhala-lklug ibus im-switch ibus-m17n m17n-db m17n-contrib language-pack-si-base

From your user account (i.e. not root) run:

rm -f ~/.xinput.d/* ; im-switch -z all_ALL -s ibus

Logout and login again. Environment variables need to be set/updated (NO NEED TO REBOOT).
From your user account (i.e. not root) select your keyboard layouts by running:
ibus-setup

3.1.4. How to test

Visit http://si.wikipedia.org/ and see if the Sinhala letters render correctly.
Copy and paste some of the content from Sinhala Wikipedia to Open Office Writer. Then highlight the Sinhala text and choose the LKLUG font to display them.
To test typing, press Control-space whilst you are running a GNOME application. Then select one of the Sinhala input methods.

3.2. Long Answer

The instructions in Section 3.1 should be sufficient. If you are using an older version of the distribution or a different distribution, then you may need to read the following section.

3.2.1. Fonts

If your distribution does not contain a Sinhala font package, then download a Unicode Sinhala font:

http://sinhala.sourceforge.net/files/lklug.ttf

If you are using a modern GNU/Linux version and it has fontconfig installed, all you have to do is make a .fonts directory in your home directory:

mkdir ~/.fonts

and copy the True/Open Type font into that directory.

If you want to make the font available to all users of the system, become root and copy the font to:

/usr/share/fonts

In both the above cases, run:

fc-cache -fv

To check which font file provides the Sinhala support, run:

fc-list :lang=si file

Immediately you'll be able to read Unicode Sinhala in these programs (You may have to restart the program.):

Anything gtk2/gtk3 based
- evolution
- gedit
- gucharmap
- Firefox/Mozilla (built with gtk2, FreeType2 and Pango support)

If you have Pango 1.8.2 and greater, you will have full SLS1134 Sinhala support. Harfbuzz does not support Sinhala nor Indic scripts well at the moment (29/08/2011).

3.2.2. Input Methods

There are a number of different input method infrastructures available on GNU/Linux. In general, first try IBus/m17n and as a last resort XKB.

To test multi-lingual input methods in gtk2/gtk3 based programs, run:

gedit

To check which input method systems are available, run:

im-switch  -l

To switch between input method systems, run:

im-switch -z all_ALL -c

Generally choose:

ibus

3.2.2.1. Surrounding Text Support (STS)

Surrounding text support allows the creation of user-friendly input methods. Input methods without surrounding text support are difficult to use when editing existing text. Some applications do not fully support surrounding text. In these circumstances you may want to use an input method that can fallback to pre-edit.

m17n-db 1.5.5 and older contain a Wijesekera (si-wijesekera) layout that requires surrounding text support, however, there is also a pre-edit version (si-wijesekera-preedit) that does not. m17n-db 1.6.0, and newer, merged the pre-edit and surrounding text support into one Wijesekera (si-wijesekera) version. It defaults to the pre-edit but the surrounding text support can be enabled via ibus-setup-m17n. Once enabled, it will automatically detect if the application supports surrounding text support. If not supported, it will fallback to pre-edit.

3.2.3. Keyboard Layouts

If you are unfamiliar with the Wijesekera keyboard layout, here are some recommended keyboard layouts. If you download any of the m17n keyboard layout files, copy it to the correct location and relogin in-order to use it.

3.2.3.1. m17n Transliteration Keyboard Layout

To familiarise yourself with this keyboard layout, read:

http://www.nongnu.org/sinhala/doc/transliteration/sinhala-transliteration_5.html

The aforementioned layout is already included in distributions that contain the m17n-contrib package. It does not require surrounding text support.

The file can be found at:

/usr/share/m17n/si-trans.mim

The latest version of the keyboard layout can be downloaded from the source repository:

si-trans.mim

3.2.3.2. m17n Phonetic Dynamic Keyboard Layout

To familiarise yourself with this keyboard layout, read:

http://www.nongnu.org/sinhala/doc/keymaps/sinhala-keyboard_4.html

The aforementioned layout is already included in distributions that contain the m17n-contrib package. It requires surrounding text support.

The file can be found at:

/usr/share/m17n/si-phonetic-dynamic.mim

The latest version of the keyboard layout can be downloaded from the source repository:

si-phonetic-dynamic.mim

3.2.3.3. XKB Phonetic Static Keyboard Layout

To familiarise yourself with this keyboard layout, read:

www.nongnu.org/sinhala/doc/keymaps/sinhala-keyboard_3.html

The X Keyboard Extension only allows one-to-one mappings between keys and codepoints, therefore rakaaranshaya, yansaya and repaya, which consist of multiple codepoints, have to be manually constructed. See the comments in the Sinhala X Keyboard Extension layout file.

The aforementioned layout is already included in distributions that ship with xkeyboard-config (xkb-data) 0.6 and above.

The file can be found at:

/usr/share/X11/xkb/symbols/lk

The latest version of the keyboard layout can be downloaded from the source repository:

Read the comments in the lk file to see how to create rakaaranshaya, yansaya and repaya.

The window manager should come with a program which allows the user to choose multiple keyboard layouts.

In the example below I have chosen the SHIFT keys to switch between the Sinhala phonetic layout and the US QWERTY layout. Hold one of the SHIFT keys down and then press the other SHIFT key, this should toggle between the layouts.

Using the GUI in GNOME:
1. Run:
  gnome-keyboard-properties
2. Choose the “Layouts” tab and click on the “Add” button. This will open a new window which contains a list of layouts ordered by country.
3. Scroll down the list till you find “Sri Lanka” and then highlight it by clicking on it. The Sinhala layout is the Default in the Sri Lanka layouts file. Then press “OK”.
4. Choose the “Layout Options” tab and click on the text “Layout switching”. A list will expand below this text.
5. Scroll down the list till you find the text “Both Shift keys together change layout”. Click on the corresponding checkbox.
6. If you wish to use an LED to indicate the toggling of keyboard layouts, click on the text "Use keyboard LED to show alternative layout". A list will expand below this text.
7. Scroll down the list till you find the text “ScrollLock LED shows alternative layout”.

Using the command line in X:

In an xterm do:

setxkbmap -layout "us,lk" -option "grp:shifts_toggle,grp_led:scroll"

Alternately, you can directly modify /etc/X11/xorg.conf:
1. To add the new lk keyboard layout, look for this line:
  Section "InputDevice"
  There will probably be two such lines, one for the keyboard and another for the mouse. Go to the keyboard related line.
2. Then add 'lk' to a line that looks like:
  Option "XkbLayout" "us,lk"
3. Also add a mechanism to switch between 'us' and 'lk' and indicate which LED should be used:
  Option "XkbOptions" "grp:shifts_toggle,grp_led:scroll"
4. If asked by the window manager, reset keyboard defaults to the X defaults.

3.2.4. Character Maps

You can use a Unicode Character Map program to copy and paste Sinhala characters into your program/document. Available programs are:

gucharmap

3.2.5. Databases

The ability to alphabetically sort words in a database is essential. Till recently databases containing Sinhala words could not be sorted according to the Sinhala sorting order as established by SLS1134 - Part 1: Collation Sequence. Sinhala words can be sorted in MySQL (from version 5.2).

Instead of running this query:

SELECT * FROM table1 ORDER BY column1;

run a slightly modified query that looks like:

SELECT * FROM table1 ORDER BY column1 COLLATE utf8_sinhala_ci;

3.2.6. Locales

The Sinhala locale for Sri Lanka is already included in most distributions. If you wish to view the user interface in Sinhala, then you must first enable the si_LK locale.

3.2.6.1. Debian and Ubuntu

The supported locales are listed in the file:

/usr/share/i18n/SUPPORTED

The locale definition is available in:

/usr/share/i18n/locales/

These files are provided by the locales package.

The locale definitions need to be compiled before they can be used. To compile a particular locale, first edit:

/etc/locale.gen

This will involve uncommenting the required locale, e.g.:

# sid_ET UTF-8
si_LK UTF-8
# sk_SK ISO-8859-2

Then run the program:

/usr/sbin/locale-gen

Following that step you can then see what a particular program looks like when it is localised:

LANG=si_LK.UTF-8 gedit
LANG=fr_FR.UTF-8 gedit

The actual translated message strings are stored in:

/usr/share/locale/<ISO639 2 character lang code>/LC_MESSAGES/

Each program adds a file containing translated message strings.

The default locale is set in the file:

/etc/default/locale

3.2.6.2. Collation

To sort Sinhala words, run:

LC_ALL=si_LK.utf8 sort

3.2.7. Open Office

By default Open Office chooses the first Sinhala font regardless of its completeness. Therefore, you may have to highlight the Sinhala text and select a complete Unicode Sinhala font such as the LKLUG font.

3.2.8. LaTeX

TeX Live, 2009 release, should contain a version of XeTeX with Unicode Sinhala support.

A less preferable option is to download sintex:

http://www.ucsc.cmb.ac.lk/People/cik/sintex-0.2.0.tar.gz

4. Developer Notes

Savannah Sinhala Project: http://savannah.nongnu.org/projects/sinhala/
Sourceforge Sinhala Project: http://sourceforge.net/projects/sinhala/

4.1. Open Type Fonts

You can use these PHP and Python scripts to generate Unicode Sinhala letters:

4.1.1. List of feature tags

http://partners.adobe.com/public/developer/opentype/index_tag3.html

4.1.2. Glyph Naming

4.1.3. Indic

4.1.4. How GNU FreeFont obtained Sinhala glyphs

4.1.5. Development Info

If you wish to use TeX font glyphs to create a new font, mftrace will be useful. In most circumstances fontforge alone will be sufficient for most of your needs.

4.2. Renderer (Layout Engine)

The top of tree Pango (since 1.8.2) & ICU (since 3.6) now support SLS1134.

4.2.1. Pango

Pango's Indic renderer is based on ICU's Indic renderer.

The original patch to add Sinhala support was created by Harsha Senanayake for ICU [1]and later ported to Pango. The Pango patch was ported to the latest version of Pango by Chamath Keppitiyagama. It was submitted to bugzilla by Anuradha Ratnaweera[2]. Harshula Jayasuriya modified the Pango state table & ZWJ handling [3] & [4].

The Pango code for Sinhala and Indic rendering is common and can be found in the Pango source at:

modules/indic/

One of the most important files to understand is:

modules/indic/indic-ot-class-tables.c

Particularly how the function:

indic_ot_find_syllable()

works.

Next have a look at the file:

modules/indic/indic-ot.c

and the function:

indic_ot_reorder()

4.2.2. ICU

Owen Taylor (Pango) submitted the Pango Sinhala patch to the ICU project [5]. Eric Mader (ICU) ported the Pango patch to ICU and checked-in the changes to ICU 3.6. Then Eric added the state table & ZWJ modifications from Pango to ICU 3.6 [6] & [7].

4.2.2.1. Split dependent vowel modifier (diga o) issue

There was an issue with U+0DDD (dependent vowel diga o) that can cause Open Office to crash. Opening this text file will crash Open Office and ICU 3.6:

icu-crash-testcase-dv-oo.txt

The worstCaseExpansion for Sinhala was set to 3 when it should have been set to 4. The dependent vowel 'oo' (U+0DDD) consists of (kombuva)(dotted-circle)(aela-pilla)(al-lakuna) which are 4 glyphs. As a result of the worstCaseExpansion being 3, memory was probably being allocated for 3 glyphs when memory was required for 4 glyphs. The actual crash occurred when unallocated memory was being freed.

Caolan McNamara also found this bug and fixed it first. [8]

4.2.2.2. Call Tree

source/layoutex/ParagraphLayout.cpp
ParagraphLayout::ParagraphLayout(const LEUnicode chars[], le_int32 count, const FontRuns *fontRuns, const ValueRuns *levelRuns, const ValueRuns *scriptRuns, const LocaleRuns *localeRuns, UBiDiLevel paragraphLevel, le_bool vertical, LEErrorCode &status)
1. source/layout/LayoutEngine.cpp:
  LayoutEngine *LayoutEngine::layoutEngineFactory(const LEFontInstance *fontInstance, le_int32 scriptCode, le_int32 languageCode, LEErrorCode &success)
2. LayoutEngine *LayoutEngine::layoutEngineFactory(const LEFontInstance *fontInstance, le_int32 scriptCode, le_int32 languageCode, le_int32 typoFlags, LEErrorCode &success)
  - IndicOpenTypeLayoutEngine::IndicOpenTypeLayoutEngine(const LEFontInstance *fontInstance, le_int32 scriptCode, le_int32 languageCode, le_int32 typoFlags, const GlyphSubstitutionTableHeader *gsubTable)
  - IndicOpenTypeLayoutEngine::IndicOpenTypeLayoutEngine(const LEFontInstance *fontInstance, le_int32 scriptCode, le_int32 languageCode, le_int32 typoFlags)
source/layout/LayoutEngine.cpp
le_int32 LayoutEngine::layoutChars(const LEUnicode chars[], le_int32 offset, le_int32 count, le_int32 max, le_bool rightToLeft, float x, float y, LEErrorCode &success)
1. le_int32 LayoutEngine::computeGlyphs(const LEUnicode chars[], le_int32 offset, le_int32 count, le_int32 max, le_bool rightToLeft, LEGlyphStorage &glyphStorage, LEErrorCode &success)
  1. source/layout/IndicLayoutEngine.cpp
    le_int32 IndicOpenTypeLayoutEngine::characterProcessing(const LEUnicode chars[], le_int32 offset, le_int32 count, le_int32 max, le_bool rightToLeft, LEUnicode *&outChars, LEGlyphStorage &glyphStorage, LEErrorCode &success)
    1. source/layout/IndicReordering.cpp
      le_int32 IndicReordering::reorder(const LEUnicode *chars, le_int32 charCount, le_int32 scriptCode, LEUnicode *outChars, LEGlyphStorage &glyphStorage, MPreFixups **outMPreFixups)
engine->getGlyphs(fStyleRunInfo[run].glyphs, layoutStatus);
engine->getGlyphPositions(fStyleRunInfo[run].positions, layoutStatus);
engine->getCharIndices(&fGlyphToCharMap[glyphBase], runStart, layoutStatus);

4.2.3. Harfbuzz

Harfbuzz does not support Sinhala nor Indic scripts well at the moment (29/08/2011). The Indic module (hb-ot-shape-complex-indic.cc) needs a lot more work to be useful.

4.3. Firefox/Mozilla

Interestingly, Debian, Fedora Core and Ubuntu decided to address enabling Pango in Firefox in completely different ways.

4.3.1. Debian

Since Debian 4.0 (Etch), Pango is enabled by default.

4.3.2. Ubuntu

4.3.2.1. Ubuntu 5.10

Ubuntu 5.10 enabled Pango by default. Have a look at:

/usr/bin/mozilla-firefox

which contains the code:

##
## Set MOZ_ENABLE_PANGO
##
MOZ_ENABLE_PANGO=1
export MOZ_ENABLE_PANGO

4.3.2.2. Ubuntu 6.06

On the other hand Ubuntu 6.06, decided to disable Pango in Firefox by default except for a pre-determined list of locales. The extensive discussion can be found here:

https://launchpad.net/distros/ubuntu/+source/firefox/+bug/32561

Have a look at:

/usr/bin/mozilla-firefox

which contains the code:

if [ "x${MOZ_DISABLE_PANGO}" = x ]; then
    if egrep '^(bn|gu|hi|kn|ml|mr|ne|pa|ta|te)_' \
        /var/lib/locales/supported.d/*[^~] >/dev/null 2>&1; then
        MOZ_DISABLE_PANGO=0
    else
        MOZ_DISABLE_PANGO=1
    fi
    export MOZ_DISABLE_PANGO
fi
if [ "x${MOZ_DISABLE_PANGO}" = x0 ]; then
    unset MOZ_DISABLE_PANGO
fi

This means that Ubuntu 6.06 users that need Pango enabled in Firefox need to set an environment variable:

MOZ_DISABLE_PANGO=0

You can see the difference by running Firefox at the command line like so:

# MOZ_DISABLE_PANGO=0 mozilla-firefox

4.3.3. Ubuntu 6.10 & Above

Ubuntu 6.10 users can enable Pango in Firefox by setting an environment variable:

MOZ_DISABLE_PANGO=0

Or by simply installing the Ubuntu package:

language-pack-si-base

4.3.4. Fedora Core

Since Fedora Core 4, Pango is enabled in Firefox by default. In order to disable Pango in Firefox an environment variable has to be set:

MOZ_DISABLE_PANGO=1

You can see the difference by running Firefox at the command line like so:

# MOZ_DISABLE_PANGO=1 firefox

Have a look at:

/usr/bin/firefox

for an explanation.

4.3.4.1. Fedora Core 3

Firefox and Mozilla can be enabled with pango rendering support, which enables many text layout features, including the rendering of CTL (Complex Text Layout) such as Indic languages. To enable this, set the following environment variable when running Firefox or Mozilla:
MOZ_ENABLE_PANGO=1 [9]

4.3.4.2. Fedora Core 4 [10]

##
## Set MOZ_ENABLE_PANGO is no longer used because Pango is enabled by default
## you may use MOZ_DISABLE_PANGO=1 to force disabling of pango
##
#MOZ_DISABLE_PANGO=1
#export MOZ_DISABLE_PANGO

4.3.4.3. Fedora Core 5 [11]

##
## In order to better support certain scripts (such as Indic and some CJK 
## scripts), Fedora builds its Firefox, with permission from the Mozilla 
## Corporation, with the Pango system as its text renderer.  This change 
## is known to break rendering of MathML, and may negatively impact 
## performance on some pages.  To disable the use of Pango, set 
## MOZ_DISABLE_PANGO=1 in your environment before launching Firefox.
##
#
# MOZ_DISABLE_PANGO=1
# export MOZ_DISABLE_PANGO
#

4.3.5. Epiphany Browser

Changelog:

2006-01-27  Christian Persch  <chpe at cvs dot gnome dot org>
        * src/ephy-main.c: (main):
        Disable pango rendering by default, unless MOZ_ENABLE_PANGO env
        var is set. Bug #328844.

src/ephy-main.c:

        /* Work around bug #328844, and avoid the gecko+pango performance problem */
        env = g_getenv ("MOZ_ENABLE_PANGO");
        enable_pango = env != NULL &&
                       env[0] != '\0' &&
                       g_ascii_strtoull (env, NULL, 10) != 0;
        if (eel_gconf_get_boolean (CONF_GECKO_ENABLE_PANGO))
        {
                g_print ("NOTE: Enabling gecko pango renderer; this may cause performance degradation.\n"
                         "You can set " CONF_GECKO_ENABLE_PANGO " to \"false\" to disable it.\n");
        }
        else if (!enable_pango)
        {
                g_setenv ("MOZ_DISABLE_PANGO", "1", TRUE);
        }

Epiphany also has a file, data/epiphany-pango.schemas containing a list of locales which require Pango to be enabled by default.

4.4. Open Office

http://wiki.services.openoffice.org/wiki/Debugging

4.4.1. Open Office 2.0.4

Whilst working on the patches for adding Sinhala support to ICU, the renderer of Open Office, I observed that the ZWJ characters do not appear to reach ICU [12]. I demonstrated this at the Red Hat Sri Lanka office (approx. 21/04/2006). Red Hat Sri Lanka then conveyed this to Red Hat India on 27/07/2006. A few days later Red Hat India opened a bug[13]. Then, Caolan McNamara found the Open Office file that filters ZWJ and ZWNJ [14].

The source file:

vcl/source/gdi/sallayout.cxx

contains a function:

inline bool IsControlChar( sal_Unicode cChar )

This function tells a caller that characters U+200B to U+200F are control characters.

In the source file:

linguistic/source/misc.cxx

two functions,

static INT16 GetOrigWordPos( const OUString &rOrigWord, INT16 nPos )

and

INT32 GetPosInWordToCheck( const OUString &rTxt, INT32 nPos )

call

inline bool IsControlChar( sal_Unicode cChar )

when doing lingustic analysis for what appears to be spelling purposes. Even found some comments written in, I assume, German.

In the source file:

vcl/source/gdi/sallayout.cxx

there is a function:

void ImplLayoutArgs::AddRun( int nCharPos0, int nCharPos1, bool bRTL )

which calls the function:

inline bool IsControlChar( sal_Unicode cChar )

it's purpose is to:

// add a run after splitting it up to get rid of control chars

It should be noted that this function handles RTL text in a different way to LTR text. My initial reaction is that should not be the case. However, I have not looked into it any further.

Compiling Open Office 2.0.4 on Debian Etch on a Pentium M 2.13 GHz with 1 GiB RAM took approximately 10 hours and required 10 GBs of additional hard drive space for the source and the compiled files.

4.4.2. Open Office 2.1

Open Office 2.1 does not filter ZWJ, therefore, it supports Unicode Sinhala.

4.5. Input Methods

The recommended infrastructures are XKB, for simple one-to-one keyboard layouts, and IBus[15]/m17n[16] for complex keyboard layouts. XKB is a component of Xorg.

4.5.1. Syllable Segmentation [17]

Sinhala letters which define the start of a new 'syllable':

All independent vowels (U+0d85 - U+0d96)
Kombuva (U+0dd9) - except if preceded by a kombuva.
All consonants (U+0d9a - U+0dc6) - except if preceded by kombuva or kombuva deka (U+0ddb)
Kunddaliya (U+0df4)
All non-Sinhala characters/codepoints - except ZWJ (U+200D)

4.5.2. Keyboard

You can use showkey in linux to display the scancode.

57 - space
56 - left alt
100 - right alt
29 - left ctrl
97 - right ctrl
42 - left shift
54 - right shift

Look in the linux source:

drivers/char/keyboard.c

Look for the function:

getkeycode()

4.5.3. XKB - adding a new keyboard layout

All you need to do is just copy the keyboard layout file into the correct directory:

/etc/X11/xkb/symbols/

/etc/X11/xkb/symbols/pc/

/usr/share/X11/xkb/symbols

However, for completeness some files in these directories:

/etc/X11/

/usr/X11R6/lib/X11/

/usr/share/X11/

need to be modified, namely these files:

xkb/rules/{xorg,xfree86}
xkb/rules/{xorg,xfree86}.lst
xkb/rules/{xorg,xfree86}.xml
xkb/symbols.dir

To test a loaded keyboard layout:

setxkbmap -print | xkbcomp -w 10 -xkb - <outfile>

4.5.4. IBus

IBus can be used as the frontend, which is exposed to the user, and the backend that maps keycodes to codepoints. Or IBus can be used as a frontend for other backends. e.g. m17n can be a backend via the ibus-m17n engine.

ibus 1.3.99.20110817 and ibus-m17n 1.3.3 support Surrounding Text Support (STS).

4.5.5. m17n

The m17n backend keyboard layout definition file is a text file. The documentation can be found:

http://www.m17n.org/common/m17n-docs-en/m17nDBFormat.html#mdbIM

4.5.6. xmodmap

The xmodmap keyboard layout is not fully functional, hence it is recommended you use the X Keyboard Extension keyboard layout. To familiarise yourself with this keyboard layout, read:

http://www.nongnu.org/sinhala/doc/keymaps/sinhala-keyboard_3.html

Download the keyboard layout from:
- sin.xmodmap
Then run xmodmap:
xmodmap sin.xmodmap

4.5.7. gvim

To familiarise yourself with this keyboard layout, read:

http://www.nongnu.org/sinhala/doc/keymaps/sinhala-keyboard_3.html

Download the keyboard layout and redirector from:
- sinhala-phonetic_utf-8.vim
- sinhala.vim
Copy the keyboard layout and redirector to ~/.vim/keymap/
Start gvim
Need to disable the menu so that you can use the 'alt' key:
set guioptions-=m
Select the new keyboard layout, using the redirector, by typing:
set keymap=sinhala
or select the new keyboard layout directly by typing:
set keymap=sinhala-phonetic_utf-8
To toggle between the Sinhala keyboard layout and the standard ASCII keyboard layout, press <Ctrl> <6> whilst in insert mode.

4.6. Databases

4.6.1. Collation

The Sinhala letters in the Unicode chart can be categorised as:

independent vowels: U+0D85 - U+0D96
consonants: U+0D9A - U+0DC6
dependent vowels: U+0DCA - U+0DF3
consonant modifiers: U+0D82 - U+0D83

The collation order of the groups can be broadly described as:

independent vowels
consonant modifiers
consonants
dependent vowels.

The order of the groups and the order of the letters within the groups do not correspond to the collation order. Hence, tailoring is required.

4.6.2. Tailoring

4.6.2.1. Sinhala tailoring rules

The first tailoring rule is the minimal description of Sinhala tailoring. The second tailoring rule is the complete description of Sinhala tailoring as required by MySQL[18].

/*
  SCCII Part 1 : Collation Sequence (SLS1134)
  2006/11/24
  Harshula Jayasuriya <harshula at gmail dot com>
  Language Technology Research Lab, University of Colombo / ICTA
*/
#if 1
static const char sinhala[]=
    "& \\u0D96 < \\u0D82 < \\u0D83"
    "& \\u0DA5 < \\u0DA4"
    "& \\u0DD8 < \\u0DF2 < \\u0DDF < \\u0DF3"
    "& \\u0DDE < \\u0DCA";
#else
static const char sinhala[]=
    "& \\u0D96 < \\u0D82 < \\u0D83 < \\u0D9A < \\u0D9B < \\u0D9C < \\u0D9D"
              "< \\u0D9E < \\u0D9F < \\u0DA0 < \\u0DA1 < \\u0DA2 < \\u0DA3"
              "< \\u0DA5 < \\u0DA4 < \\u0DA6"
              "< \\u0DA7 < \\u0DA8 < \\u0DA9 < \\u0DAA < \\u0DAB < \\u0DAC"
              "< \\u0DAD < \\u0DAE < \\u0DAF < \\u0DB0 < \\u0DB1"
              "< \\u0DB3 < \\u0DB4 < \\u0DB5 < \\u0DB6 < \\u0DB7 < \\u0DB8"
              "< \\u0DB9 < \\u0DBA < \\u0DBB < \\u0DBD < \\u0DC0 < \\u0DC1"
              "< \\u0DC2 < \\u0DC3 < \\u0DC4 < \\u0DC5 < \\u0DC6"
              "< \\u0DCF"
              "< \\u0DD0 < \\u0DD1 < \\u0DD2 < \\u0DD3 < \\u0DD4 < \\u0DD6"
              "< \\u0DD8 < \\u0DF2 < \\u0DDF < \\u0DF3 < \\u0DD9 < \\u0DDA"
              "< \\u0DDB < \\u0DDC < \\u0DDD < \\u0DDE < \\u0DCA";
#endif

4.6.3. MySQL

4.6.3.1. Terminology

ci = case insensitive
cs = case sensitive
bin = binary

4.6.3.2. Useful Commands

SHOW CHARACTER SET;
SHOW COLLATION;
SHOW COLLATION like 'ucs%';
SHOW COLLATION like 'utf8%';
SET NAMES 'utf8'; // after connecting to server if the server has NOT set 'skip-character-set-client-handshake'
SHOW CREATE TABLE <table-name>;
SHOW VARIABLES;
\s

4.6.3.3. Configure MySQL Server

Edit the file /etc/mysql/my.cnf and add to the [mysqld] section:

default-character_set=utf8
skip-character-set-client-handshake

This is done to ensure that UTF-8 is the default encoding for the server and client.

4.6.3.4. Testing MySQL

Test Swedish (MySQL is a Swedish company) collation algorithm:
http://en.wikipedia.org/wiki/Swedish_alphabet
CREATE TABLE t1 ( id SERIAL PRIMARY KEY, letter VARCHAR(10) NOT NULL ) CHARACTER SET utf8;
Running this collation handler:
SELECT * FROM t1 ORDER BY letter COLLATE utf8_swedish_ci;
results in non-English letters appearing at the end of the sorted alphabet as expected.
Test new Sinhala collation algorithm:
http://en.wikipedia.org/wiki/Sinhala_alphabet
CREATE TABLE t2 ( id SERIAL PRIMARY KEY, letter VARCHAR(10) NOT NULL ) CHARACTER SET utf8;
Load the data from this file:
- mysql-data-sinhala.txt
Running this collation handler:
SELECT * FROM t2 ORDER BY letter COLLATE utf8_sinhala_ci;
results match the SCCII Part 1 : Collation Sequence (SLS1134).
Download the output file from:
- mysql-output-utf8_sinhala_ci.txt

4.6.3.5. Source Code

mysql/strings/ctype-uca.c

/*
  Collation language is implemented according to
  subset of ICU Collation Customization (tailorings):
  http://www.icu-project.org/userguide/Collate_Customization.html
  
  Collation language elements:
  Delimiters:
    space   - skipped
  
  <char> :=  A-Z | a-z | \uXXXX
  
  Shift command:
    <shift>  := &       - reset at this letter. 
  
  Diff command:
    <d1> :=  <     - Identifies a primary difference.
    <d2> :=  <<    - Identifies a secondary difference.
    <d3> := <<<    - Idenfifies a tertiary difference.
  
  
  Collation rules:
    <ruleset> :=  <rule>  { <ruleset> }
    
    <rule> :=   <d1>    <string>
              | <d2>    <string>
              | <d3>    <string>
              | <shift> <char>
    
    <string> := <char> [ <string> ]
  An example, Polish collation:
  
    &A < \u0105 <<< \u0104
    &C < \u0107 <<< \u0106
    &E < \u0119 <<< \u0118
    &L < \u0142 <<< \u0141
    &N < \u0144 <<< \u0143
    &O < \u00F3 <<< \u00D3
    &S < \u015B <<< \u015A
    &Z < \u017A <<< \u017B    
*/

mysql/include/m_ctype.h

typedef struct charset_info_st
{
  uint      number;
  uint      primary_number;
  uint      binary_number;
  uint      state;
  const char *csname;
  const char *name;
  const char *comment;
  const char *tailoring;
  uchar    *ctype;
  uchar    *to_lower;
  uchar    *to_upper;
  uchar    *sort_order;
  uint16   *contractions;
  uint16   **sort_order_big;
  uint16      *tab_to_uni;
  MY_UNI_IDX  *tab_from_uni;
  MY_UNICASE_INFO **caseinfo;
  uchar     *state_map;
  uchar     *ident_map;
  uint      strxfrm_multiply;
  uchar     caseup_multiply;
  uchar     casedn_multiply;
  uint      mbminlen;
  uint      mbmaxlen;
  uint16    min_sort_char;
  uint16    max_sort_char; /* For LIKE optimization */
  uchar     pad_char;
  my_bool   escape_with_backslash_is_dangerous;
  
  MY_CHARSET_HANDLER *cset;
  MY_COLLATION_HANDLER *coll;
  
} CHARSET_INFO;

mysys/charset.c: CHARSET_INFO *all_charsets[256]
mysys/charset.c
init_available_charsets()
- mysys/charset-def.c
  init_compiled_charsets()
  - mysys/charset.c
    add_compiled_collation()
- init_state_maps()
strings/ctype-uca.c
my_coll_init_uca()
- create_tailoring()
  - my_coll_rule_parse()
    - my_coll_lexem_init()
  [Copy the default weights data to the new weights data structure]
mysys/charset.c
get_charset_by_name()
- get_collation_number()
  - get_collation_number_internal()
- get_internal_charset()

4.6.3.6. Files requiring modification

mysql/config/ac-macros/character_sets.m4
mysql/mysys/charset-def.c
mysql/strings/ctype-uca.c
mysql/configure (generated)
mysql/mysql-test/t/ctype_utf8.test
mysql/mysql-test/r/ctype_utf8.result

4.6.3.7. Patch

4.7. Locales

4.7.1. Collation

To define a collation order for a script, edit:

/usr/share/i18n/locales/iso14651_t1_common

To test the changes, run:

LC_ALL=<LOCALE> sort

4.8. Translations

4.8.1. GNOME

4.8.2. KDE

4.9. Packaging

4.9.1. Debian

Debian Developers' Corner
Debian New Maintainers' Guide
Debian Developer's Reference
Debian Policy Manual
The Debian GNU/Linux FAQ
Debian Library Packaging guide
Work-Needing and Prospective Packages
To understand issues involving config.guess and config.sub, refer to /usr/share/doc/autotools-dev/README.Debian.gz .
In debian/rules, for the target binary-arch, you'll probably have to uncomment dh_install. Or you can just change the DESTDIR of:
$(MAKE) DESTDIR=$(CURDIR)/debian/tmp install
pbuilder
To extract the content of a .deb file:
ar -x <name>.deb
To create a Debianized source tree from a Debian source package:
dpkg-source -x <name>.dsc
To generate a .deb from within a Debianized source tree:
dpkg-buildpackage -b -uc -us
To generate the .dsc and the .diff.gz run:
dpkg-source -b <dir>
on the untarred Debianized source tree. Remember to leave the renamed original source tarball of the package in the parent directory.
Don't forget to sign the source package:
debsign -m<maintainer> <name>.dsc
irc.oftc.net, channel #debian-mentors

4.9.2. Ubuntu

4.10. DONE

Renderer (Layout Engine)
- Pango
  - Created and submitted Pango patch - don't implicitly create conjuncts
    - http://bugzilla.gnome.org/show_bug.cgi?id=161981
  - Inform bengalinux-core team the implications of the fix to Pango Bug 145233 – Zero-width non-joiner is displayed (http://bugzilla.gnome.org/show_bug.cgi?id=145233)
    - http://www.mail-archive.com/bengalinux-core@lists.sourceforge.net/msg00502.html
  - Created and submitted Pango patch - Enable touching letters in Sinhala rendering
    - http://bugzilla.gnome.org/show_bug.cgi?id=302577
  - Created and submitted Pango patch - Worst case expansion for Sinhala
    - http://bugzilla.gnome.org/show_bug.cgi?id=385321
  - Created and submitted Pango patch - Cursor positioning for Sinhala is broken
    - http://bugzilla.gnome.org/show_bug.cgi?id=451682
  - Created and submitted Pango patch - Remove SF_MPRE_FIXUP from Sinhala script flags
    - http://bugzilla.gnome.org/show_bug.cgi?id=536017
- ICU
  - Provided test files and images for ICU Sinhala support
    - http://bugs.icu-project.org/trac/ticket/4298
  - Convinced Pango and ICU maintainers to emit ZWJ to the font lookup stage
    - ICU: ZWJ Processing in Sinhala / Implement PR 37 ZWJ/ZWNJ Behavior
      - http://bugs.icu-project.org/trac/ticket/4710
      - http://bugs.icu-project.org/trac/ticket/4711
  - Created and submitted ICU patch - Indic Reordering State Table Allows ZWJ Virama ZWJ
    - http://bugs.icu-project.org/trac/ticket/5057
  - Ported Pango patches, which add Sinhala support, to ICU for immediate use in ICU 3.4
    - http://www.redhat.com/archives/fedora-cvs-commits/2006-May/msg00126.html
  - Created and submitted ICU patch - SF_MPRE_FIXUP causing conjuncts to be split in Sinhala
    - http://bugs.icu-project.org/trac/ticket/6232
- Open Office
  - Discovered Open Office was filtering ZWJ
    - http://mail.lug.lk/lurker/message/20060410.130454.19cefb01.en.html
  - Convinced Open Office developers to stop filtering ZWJ and ZWNJ - ZWJ: The zero width joiner shouldn't be filtered out
    - http://qa.openoffice.org/issues/show_bug.cgi?id=68047
  - Reported that Open Office was filtering ZWJ and ZWNJ to Ubuntu - Incorrect Bengali rendering of ra+japhala
    - https://launchpad.net/distros/debian/+source/icu/+bug/35085
  - Reported Open Office bug to Debian - The zero width joiner shouldn't be filtered out
    - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=403275
- Created and submitted Epiphany patch - Add si (Sinhala) to the list of locales requiring Pango
  - http://bugzilla.gnome.org/show_bug.cgi?id=361538
- Created and submitted Ubuntu Firefox patch - Add si (Sinhala) to the list of locales requiring Pango
  - https://launchpad.net/distros/ubuntu/+source/firefox/+bug/66270/
Input Methods
- Created and submitted XKB keyboard layout to X Keyboard Configuration Database
- Created and submitted vim keyboard layout to Bram
- Created and submitted XKB keyboard layout to xorg
  - https://bugs.freedesktop.org/show_bug.cgi?id=1850
- Created and submitted XKB keyboard layout to xfree86
  - http://bugs.xfree86.org/show_bug.cgi?id=1509
- Tested and provided feedback on the m17n Wijesekera input method
- Updated XKB keyboard layout
  - https://bugs.freedesktop.org/show_bug.cgi?id=11284
- Developed Sinhala Phonetic keyboard layouts and a Transliteration scheme
- Implemented phonetic and transliteration input methods for m17n
- Ensured surrounding text support was added to IBus
  - https://bugzilla.redhat.com/show_bug.cgi?id=435880
- Add Sinhala to /usr/include/X11/keysymdef.h
Fonts
- Discovered that the printing problem was due to the font being an OTF font. Once the LKLUG font was changed to a TTF, the printing problem disappeared.
  - http://www.lug.lk/lurker/message/20050810.094347.8539b8d4.en.html
- Added a glyph for “Kunddaliya” to LKLUG font
- Reorganised glyphs containing “Repaya” and added corresponding lookups to LKLUG font
- Created and submitted fontconfig patch - fontconfig: fix Sinhala coverage
  - http://bugs.freedesktop.org/show_bug.cgi?id=19288
- Request to remove incomplete Sinhala glyphs
  - http://lists.gnu.org/archive/html/freefont-bugs/2009-02/msg00000.html
- Improved the range and correctness of the Unicode Sinhala section in the FreeFont
  - https://savannah.gnu.org/forum/forum.php?forum_id=6518
Standards
- Amended ISO639 to include 'Sinhala' in the languages list alongside 'Sinhalese'
  - http://www.loc.gov/standards/iso639-2/php/code_list.php
- Reported errors in the SCCII - Part 1: Collation Sequence (SLS1134)
  - http://sourceforge.net/mailarchive/message.php?msg_id=919347
Collation
- Created and submitted MySQL patch - Add Sinhala script (Sri Lanka) collation to MySQL
  - http://bugs.mysql.com/bug.php?id=26474
- Created and submitted glibc patch - Sinhala (si) collation order undefined
  - http://sourceware.org/bugzilla/show_bug.cgi?id=6968

4.11. TODO

Software should use the word “Sinhala”, not “Sinhalese”
Software should use the word "Sinhala" to refer to the language. Unfortunately, some software projects use “Sinhalese", “Singhalese" or some other variant.
If you come across FOSS or non-FOSS software that do not use the word "Sinhala" to refer to the language, can you please file a bug with the software or send an email to the software maintainer(s)?
ISO 639 is the reference used by most software projects. “Sinhala" is the preferred word for the language as per ISO 639 and the Sri Lankan constitution:
- http://www.loc.gov/standards/iso639-2/php/code_list.php
  sin si Sinhala; Sinhalese
- http://www.constitution.gov.lk/downloads/Chapter%20IV%20-%20Language.pdf
  18. 3[(1)] The Official Language of Sri Lanka shall be Sinhala.
“Sinhala” is the preferred word for the script as per ISO 15924:
- http://www.unicode.org/iso15924/iso15924-codes.html
  Sinh 348 Sinhala singhalais Sinhala 2004-05-01
Furthermore, if it is FOSS project, you can refer to the GNOME and KDE projects as an example where the word "Sinhala" is used:
- http://l10n.gnome.org/languages/si/
  Sinhala Translation Team
- http://l10n.kde.org/team-infos.php?teamcode=si
  Sinhala Team (si)
This is an easy way to help out, so if get a chance please do your little bit, good luck!
Renderer (Layout Engine)
- ccmp feature for Indic Languages...
  - http://bugs.icu-project.org/trac/ticket/7601
- Support touching letters in QT Renderer
Input Methods
- See if XKB can be extended to allow multiple codepoints per keycode
Fonts
- Learn about OT features/order
- Develop a standard lookup table for font developers
Sorting
- String matching - a consonant followed by dependent vowel 'o' should not match the same consonant followed by dependent vowel 'oo'
- Can the DUCET be updated with the correct Sinhala collation sequence?
- MySQL Sinhala locale needed
Other GNU/Linux Infrastructure
- OTF printing problem
Printing
- Firefox & Thunderbird can't print Sinhala
Misc
- Submit corrections to Unicode. e.g. aae Vs aee
- UTF-8 should be declared the standard file encoding
- Develop Sinhala IPA transliteration for documents
- Develop Sinhala literary transliteration for documents
- English Locale for Sri Lanka

[1]	http://marc.theaimsgroup.com/?t=106354110900001=1=2
[2]	http://bugzilla.gnome.org/show_bug.cgi?id=153517
[3]	http://bugzilla.gnome.org/show_bug.cgi?id=161981
[4]	http://bugzilla.gnome.org/show_bug.cgi?id=302577
[5]	http://bugs.icu-project.org/trac/ticket/4298
[6]	http://bugs.icu-project.org/trac/ticket/4711
[7]	http://bugs.icu-project.org/trac/ticket/5057
[8]	http://bugs.icu-project.org/trac/ticket/5501
[9]	http://download.fedora.redhat.com/pub/fedora/linux/core/3/i386/os/RELEASE-NOTES-en.html
[10]	http://cvs.fedora.redhat.com/viewcvs/checkout/rpms/firefox/devel/firefox.sh.in?rev=1.8
[11]	http://cvs.fedora.redhat.com/viewcvs/checkout/rpms/firefox/devel/firefox.sh.in?rev=1.11
[12]	http://mail.lug.lk/lurker/message/20060410.130454.19cefb01.en.html
[13]	https://bugzilla.redhat.com/show_bug.cgi?id=200728
[14]	http://www.openoffice.org/issues/show_bug.cgi?id=68047
[15]	http://code.google.com/p/ibus/
[16]	http://www.m17n.org/
[17]	http://sourceforge.net/mailarchive/message.php?msg_id=919364
[18]	http://lists.mysql.com/internals/34303