AraMorph

the AraMorph site
 
   

Technical principles of the morphological analysis

The morphological analyzer

Let's create a src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/ktAb.txt file in which we will type a single word, كتاب.

Let's then execute the following code :

java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar ¶
gpl.pierrick.brihaye.aramorph.AraMorph ¶
src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/ktAb.txt ¶
CP1256 results.txt UTF-8 -v
	
Warning
Of course, the file's encoding should fit your text editor's one.
Warning
The dictionaries have a fair memory footprint. You may have to increase the memory allocated to Java by using options such as -Xms128M -Xmx192M.

Let's have a look at the output results.txt file, which is encoded in UTF-8 :

Processing token : 	كتاب
Transliteration : 	ktAb
Token not yet processed.
Token has direct solutions.

SOLUTION #3
Lemma  : 	kAtib
Vocalized as : 	كُتّاب
Morphology : 
	prefix : Pref-0
	stem : N
	suffix : Suff-0
Grammatical category : 
	stem : كُتّاب	NOUN
Glossed as : 
	stem : authors/writers


SOLUTION #1
Lemma  : 	kitAb
Vocalized as : 	كِتاب
Morphology : 
	prefix : Pref-0
	stem : Ndu
	suffix : Suff-0
Grammatical category : 
	stem : كِتاب	NOUN
Glossed as : 
	stem : book


SOLUTION #2
Lemma  : 	kut~Ab
Vocalized as : 	كُتّاب
Morphology : 
	prefix : Pref-0
	stem : N
	suffix : Suff-0
Grammatical category : 
	stem : كُتّاب	NOUN
Glossed as : 
	stem : kuttab (village school)/Quran school

	

The way the morphological analyzer works becomes then more obvious :

MessageMeaning
Processing tokenthe word being processed
Transliterationthe transliteration of the word in the Buckwalter's transliteration system ; only with the -v parameter and if no output encoding is specified
Token not yet processed.indicates that the word hasn't been processed yet and that it isn't in AraMorph's cache ; only with the -v parameter
Token has direct solutions. indicates that the word can be analyzed as it is written ; only with the -v parameter.
Indeed, AraMorph is able to take alternative writings into consideration like a final ـه in place of a ـة or a final ـى in place of a ـي...
SOLUTION indicates each solution for the word. The display order is not significant.
Lemmaindicates the lemma's ID in the stems dictionary.
Vocalized as :indicates the vocalization of the solution.
Morphology :indicates the morphological category of the prefix, the stem and the suffix of the solution.
Grammatical category :indicates the grammatical category of the prefix, the stem and the suffix of the solution.
Glossed as :indicates one or more english glosses for the prefix, the stem and the suffix of the solution.
Note
The explanations about the morphological categories are available in this section section those about grammatical categories in this section.

How does AraMorph manage to propose acceptable solutions ?

First, you have to know that AraMorph, like its predecessor in Perl, works with a transliteration of the arabic word. This transliteration obviously uses Buckwalter's transliteration system. Thus, كتاب is transliterated in ktAb before its morphological analysis.

Fixme (PB)
This operation should not be necessary since Java works natively with Unicode. A code optimization, that would allow to bypass the transliteration step and thus increase performance is to be done.

Then, AraMorph uses a brute force algorithm to decompose the word in a sequence of possible prefix, stem and suffix :

prefixstemsuffix
ktAbØØ
ktAbØ
ktAØb
ktAbØ
ktAb
ktØAb
ktAbØ
ktAb
ktAb
kØtAb
ØktAbØ
ØktAb
ØktAb
ØktAb
ØØktAb

Then, AraMorph checks the presence of each element in three dictionaries :

  • the prefix, in gpl/pierrick/brihaye/aramorph/dictionaries/dictPrefixes
  • the stem, in gpl/pierrick/brihaye/aramorph/dictionaries/dictStems
  • the suffix, in gpl/pierrick/brihaye/aramorph/dictionaries/dictSuffixes

If successful, AraMorph grabs the morphological information for each element.

Warning
The Ø prefixes and suffixes are morphologically significant.

If applicable, AraMorph then checks if the morphologies of each element are compatible between each other by looking-up three tables containing valid combinations :

  • between the prefix and the stem, in gpl/pierrick/brihaye/aramorph/dictionaries/tableAB
  • between the prefix and the suffix, in gpl/pierrick/brihaye/aramorph/dictionaries/tableAC
  • between the stem and the suffix, in gpl/pierrick/brihaye/aramorph/dictionaries/tableBC

A word decomposition whose :

  1. prefix, stem and suffix have a dictionary entry,
  2. prefix, stem and suffix are morphologically compatible between each other,

... is a solution. For كتاب, there are three ones as we can see above.

Warning
Some informations in the stems dictionary are in fact relevant for prefixes or suffixes. AraMorph, when it returns a solution, tries to shift these informations towards the prefixes or the suffixes. It is thus possible to have several prefixes and/or suffixes for a single word.
If some interpretation problem occurs, rarely enough however, messages are displayed on the console.