Download
    Installation
    Presentation
       Architecture
       Server
       Document repositories
       Repository
      +Multilingualism <-
    Configuration


SDX

SDX and multilingualism

SDX is a tool for searching and displaying XML documents, which emphasizes language issues. That is the reason why it is presented as multilingual. This concept is quite large and we will describe, on the one hand what it means for SDX, on the other hand which are the multilingual services offered by SDX. Avant d'y arriver, il est important de rappeler un fait important à ce sujet:

SDX est une plate-forme qui permet de réaliser des applications de recherche multilingues; this does not mean that SDX already includes all necessary tools for processing correctly all languages which a developer may wish to implement.

What is a language for SDX ?

All discussion on multilingualism must, beforehands, define what a language is in SDX-2. Especially because SDX has been built upon two technologies which include their own conception of a language (Java et XML); good news, both conceptions are not in opposition.

For Java, a language is actually a "local" (ie. class java.util.Locale for a full definition), which combines a language code (ISO-639), a country code (ISO-3166) and a "variant", optional, which allows to define several language options.

For XML, section 2.12 of the standard states what a language is and how it can be represented. XML includes a specific attribute (xml:lang) to identify a language, this attribute must have a value which is defined in the IETF RFC-1766. This standard defines that a language is specified by a combination of a language code, a country code according similar standards as Java, although this combination can be replaced by a language code registrated in the IANA.

From both definitions, it has been decided that, for SDX-2, a language is a combination of an ISO-639 language code, an ISO-3166 country code (optional) and a variant (optional). However, for representing those languages in an XML environment, the xml:lang attribute is used for the two first parts, and a "variant" attrubute is used for the third one. That is the reason why it is possible to set, for example:

  <sdx:field xml:lang="fr-CA" variant="qc".../>

However, for the language of interface, which is dealt with in the following section, the concept of variant is never used; only a language code and a country code are used, both are identified with a xml:lang attribute for example.

Language of interfaces

Web applications which are built with SDX usually make dynamic Web pages with the XSP language, and XSP usually includes an sdx:page element. That element then allows to use specific SDX services. This element, whatever it contains, has a great impact: the XML document (virtual) generated by the XSP page will have sdx:document as root element, and this element will contain xml:lang as an attribute which identifies the language desired for the interface. For example, we may find:

<sdx:document xml:lang="fr-CA">
  ...
  <sdx:results ...>
    ...
  </sdx:results>
</sdx:document>

This attribute always exists, it must be used as a basis for creating multilingual interfaces. In an XSLT transformation applied to this virtual document to make an HTML page (this is only an example), this information can be used to generate a different content according to the language:

...
<xsl:variable name="l" select="string(/sdx:document/@xml:lang)"/>
...
<xsl:choose>
  <xsl:when test="starts-with($l, 'fr')">Bonjour</xsl:when>
  <xsl:when test="starts-with($l, 'en')">Hello</xsl:when>
  ...
</xsl:choose>
...

This feature appears interesting (though it has nothing revolutionary), but it raises a key issue: how does SDX determine the language which is specified in this attribute  ? Actually, SDX will search this information in various locations, in a specific order, and the first language it identifies is the one it will write in the attribute. We will document those locations where the language is defined and in which order they are checked.

where is the language ?

The language of the application

In the configuration file (application.xconf), The superior element sdx:application must have an attributexml:lang. The value of this attribute can be used by SDX to identify the display language. This value can be considered as the default language of the application; for a single language application this attribute is usually enough to indicate the language.

The user language

Web applications which are built with SDX contain user services for providing Web pages. Even when there is no formal identification, SDX consider a specific user, the anonymoususer; no language is linked to that user, but it is an exception. All other users have an associated language, and this language can be used to determine the display language.

In the SDX interface for user creation, a language can be specified for a user. Applications with their own interface of user management can link a language and a method, SDX will still use that language to specify the display language.

The language of the XSP page

In an XSP page, a language can be directly specified and it is possible for SDX to use that language for the display language. This language specification must be done with an xml:lang attribute of the sdx:page element or its parent element xsp:page. For example:

<xsp:page>
  <sdx:page xml:lang="es">
  ...
  </sdx:page>
</xsp:page>

Dynamic language parameter

The SDX API uses a generic concept of parameters and proposes various ways of specifying those parameters in an XSP page. Further details can be found in the documentation on the concept of parameters, but we will illustrate their use for the language.

SDX parameters must have a name; for the language, the parameter name is lang. Then, SDX offers five ways to specify a value of a parameter, which are, ordered by priority: a Java variable which is not null and is definied in the XSP; a value which comes from an HTTP request parameter (thus from the URL); a value which is stocked in a session object on the server; an explicit value in the XSP page, finally a default value in SDX. We must also mention that for the language, there is no default value in SDX; this feature is not important, since the application language (defined in the application.xconf) is considered as the default value. We shall also mention that using an explicit value for a parameter does not have any advantage compared with using the xml:lang attribute in the XSP page, as mentioned above.

This parameter code must be associated to the sdx:page element. The following example shows various ways to do this:

<!-- Valeur explicite -->
<sdx:page lang="fr">

<!-- Value which comes from the session object whose key is "language" -->
<sdx:page langSession="language">

<!-- Value which come from an HTTP parameter "l" (for example the test.xsp?l=fr URL) -->
<sdx:page langParam="l">

<!-- Value which comes from a Java variable which has been previously defined -->
<xsp:logic>
  String l = "fr";
</xsp:logic>
<sdx:page langString="l">

Thos examples all use the attributes method to identify a parameter. But SDX offers a second method, which is sub-elements sdx:parameter to specify parameters. It is therefore possible to replace the third example above by the following code:

<sdx:page>
  <sdx:parameter name="lang" valueParam="l"/>

This method of parameters is very flexible, and it is necessary to create genuine dynamic and multilingual Web applications. For example, it is possible to add the l parameter to all interfaces and to give them the language value, and to use the langParam="l" attribute in the XSP page, so that the display language always has a suitable value.

However, this approach has a drawback: the developer must always include the l parameter in the hypertextual links, which may lead to mistakes. As a solution, it is possible to stock that value in an object session, SDX can then use it. It is only necessary to use this in the XSP pages:

<sdx:page langSession="l" langParam="l">

Specifying both indications is not a problem, on the contrary, since SDX defines a priority: the URL parameter (langParam attribute is a priority on the session object, but hat priority is only considered if there is indeed that parameter in the URL. Moreover, if the URL contains that parameter, then SDX uses it and keeps its value as a session object, because of the langSession attribute with the same value as the one of the langParam attribute.

This feature of SDX parameters is probably the one which leads to the best flexibility and ease of use for language management. This can be illustrated with an example. Assuming that the above code belongs to an XSP page called index.xsp. This page is successively called with various parameter values, and let's consider the way the display language is specified:

  1. index.xsp: the language is defined in the session object of which the key is code. If it does not exist, then SDX looks for the language, somewhere else.

  2. index.xsp?l=en: the language will be en.

  3. index.xsp: language does not change, it is still en.

  4. index.xsp?l=fr: the language changes for fr.

This approach allows the developer to specify a URL parameter only when he wants to change the display language. It is also possible to add a langString attribute which behaves the same way as langParam, but with a value defined by a Java variable and with an even higher priority. This allows to specify the language according to a logic which is more precise than the one entailed by simple URL parameter values, if necessary.

Search order

Sdx can search language information in dynamic parameters, explicitely in the XSP page, in the current user information and in the configuration of the application. All these locations can contain language information, tehrefore a priority must be defined. Dynamic information should logically be the priority; We also apply a second rule which is that information which is contained in the XSP page is prioritary.

Consequently, The priority order is:

  1. Dynamic parameters

  2. Language of the XSP page

  3. The user language

  4. The application language (application.xconf)

Language and search functionalities

The other important feature of multilingualism for SDX is the relationship with the search function. Indeed, every application which processes text is biased by the language of the texts it deals with. In documentary search, this bias appears in three different ways:

  1. When indexing and searching it, the text must be separated into words, this division depends on the language of text.

  2. When indexing and searching, it may be necessary to transform the text for, for example, putting it in minuscule characters, delete diacritics, etc. Once again, the text language biases this transformation.

  3. When ordering search results, according to textual inforation which are contained in various fields values. Ordering on textual content also depends on text language.

SDX allows the applications developer to control the operations which depend on the language very precisely; this control mainly applies at the level of fields of the document base. When defining those fields, it is possible to set a language for them and, optionally, a specific word analyser. It is important to that the discussion below is specific to the use of the Lucene engine by SDX; possibly, if SDX allows to use additional search engines, those information may not apply.

The word analyser

When SDX indexes textual content in field of type word, textual content is first transformed by a word analyser. Indeed, each single word is stocked in the index, and those words can be transformed in order, for example, to get the whole texty in minuscule characters and delete diacritics. For example, when indexing textual content Première personne de la soirée (First person in the party), an analyser could provide three words to index: première, personne and soirée. The analyser is also in charge of deleting stop words. To make the system work, the same transformation must be applied, with the same word analyser, when executing a search request. If a user searches Soirée, the word analyser provides the word soiree, which allows to find the document which contains the content which has been mentioned as an example.

It would be difficult to consider a word analyser which would work well for all languages, since, word identification and transformation uselly depends on the language of text. That is the reason why SDX allows the application developer to use different word analysers for each field of the database. Those analysers can either be included in SDX, or provided by the developer himself; a word analyser is only a Java class which extends (directly or indirectly) the fr.gouv.culture.sdx.search.lucene.analysis.Analyzer class. The following word analysers are included in SDX 2:

Additional analysers may be included in SDX; every contributions will be welcomed.

In order to make more flexible to use word analysers, or to allow analysers to be used in a slightly different manner, SDX includes a mechanism of configuration of analysers. This configuration is performed through an XML document which contains various information types, which can be specific to the analyser or apply to all analysers, unless there is an exception. A configuration file for each language is included in SDX, in a file which is located in the sdx/resources/conf/analysis from the SDX installation directory (for example .../webapps/sdx). Here is a part of the configuration file for French language:

<?xml version="1.0" encoding="ISO-8859-1"?>
<french useStopWords="true" keepAccents="false">
   <stopWords>
     <stopWord>le</stopWord>
     <stopWord>la</stopWord>
     <stopWord>les</stopWord>
     ...
  </stopWords>
</french>

Sub-elements stopWord contain stop words; moreover, the keepAccents attribute defines whether the analyser shall delete diacritics or not. It is therefore an optional functionality of the word analyser for the French language.

For each field, a word analyser must exist. This analyser must be directly specified, indirectly, be the default analyser of the document base or the default SDX analyser. NWe will examine how those indications must be used.

Directly specify a word analyser

Two attributes allow to directly specify which word analyser must be used: analyzerClass and analyzerConf. Those attributes can be associated to the sdx:field element, and they identify the word analyser of the given field, otherwise they can be associated to the sdx:fieldList element, they identify the default word analyser for the document base.

The analyzerClass attribute indicates the class to be used as the analyser; such as described above, this class must extend the fr.gouv.culture.sdx.search.lucene.analysis.Analyzer class and and it must obsiously be included in the Java CLASSPATH.

The analyzerConf attribute indicates the location the configuration file for the word analyser. The attribute value is a URL, absolute or relative to the application.xconf file. If this attribute is not defined, the default configuration of the analyser is used. This attribute can also specify a configuration file if the word analyser is specified indirectly.

Specifying a word analyser indirectly

A word analyser may also be specified indirectly, which means by precising the language of the field content or of all document base fields. If the xml:lang and variant attributes are used with the elements sdx:field (for a particular field) or sdx:fieldList (as default value for all document base fields), SDX will attempt to find a suitable word analyser for this language.

The method used by SDX to find this analyser can be demonstrated through an example. Assuming the following definition :

<sdx:field ... xml:lang="fr-CA" variant="qc"/>

The algorythme will be:

  1. If a class fr.gouv.culture.sdx.search.lucene.analysis.Analyzer_fr_CA_qc exists and if this class is available, it is used, otherwise, next step.

  2. If a class fr.gouv.culture.sdx.search.lucene.analysis.Analyzer_fr_CA exists and thos class is available, it is used, otherwise next step.

  3. If a class fr.gouv.culture.sdx.search.lucene.analysis.Analyzer_fr exists and this class is available, it is used, otherwise, next step.

  4. The default class fr.gouv.culture.sdx.search.lucene.analysis.Analyzer is used.

If a xml:lang attribute is used together with a analyzerClass attribute, the latter is used to find the word analyser. The direct method is prioritary on the indirect one.

SDX default analyser

When no analyser is specified for a field or for a document base, SDX uses the fr.gouv.culture.sdx.search.lucene.analysis.DefaultAnalyzer analyser, which is a word analyser for the English language. This analyser is also used if a field indirectly declares an analyser that is not available with SDX.

Ordering results

Search results provided by SDX can be ordered accordinh to whatever field which is defined in the document base. Lists of terms of the fields are always given in alphabetic order. Both ordering operations performed by SDX try to take into account language of fields.

If a sdx:field element has a xml:lang attribute and, possibly a variant attribute, those information can be used to build a comparator (Java concept) which allows to correctly order results according to the language. Further information on those comparators and their use for ordering and text, see the java.text.Collator Java class.

Java virtual machines are usually released with several collator for various languages. Check the documentation of virtual machines for a list of available languages.



Auteur : Martin Sévigny ( AJLSM ) - 2003-06-03