AraMorph

the AraMorph site
 
   

Using with Lucene

The arabic analyzer for Lucene

Let's execute the following code :

java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar;lib/lucene-1.4.3.jar ¶
gpl.pierrick.brihaye.aramorph.test.TestArabicAnalyzer ¶
src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/ktAb.txt CP1256 results.txt UTF-8
		
Warning
Of course, the input file's encoding should be correctly configured.

Let's have a look at the results.txt output file :

كِتاب	NOUN	[0-4]	1
كُتّاب	NOUN	[0-4]	0	
		

The principle is thus the following : every matching stem returns a Lucene token. We can see the token's text (termText), its grammatical category, its position in the input stream (startOffset and endOffset) and its relative position (positionIncrement) in regard to the previous token.

It should indeed be pointed out that a same arabic word, because it is generally reducted to its consonantical skeleton, may often return several solutions.

Fixme (PB)
Should a text be vocalized, the vowels are unfortunately not taken into account to disambiguize the analysis. The resolution of this problem is to be done.
Note
In this example, the tokens' text is in arabic but, for performance reasons, it should be better to return the tokens in the Buckwalter's transliteration system by not specifying an output encoding for the results' file.
Warning
In this example, although كتاب has 3 solutions, we only have 2 tokens. Indeed, كُتّاب is represented twice, the first time as the plural of كاتب, the second time as a singular noun with the meaning of Quran school.
Fixme (PB)
What to do if the tokens have different grammatical categories ? Since the Lucene index looses the type information, this problem may be inaccurate.

Let's try another example with a tktbn.txt file containing the word تكتبن :

java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar;lib/lucene-1.4.3.jar ¶
gpl.pierrick.brihaye.aramorph.AraMorph ¶
src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/tktbn.txt CP1256 results.txt UTF-8

Let's have a look at the results.txt output file :

Processing token : 	?????

SOLUTION #3
Lemma  : 	>akotab
Vocalized as : 	tukotibna
Morphology : 
	prefix : IVPref-Antn-tu
	stem : IV_yu
	suffix : IVSuff-n
Grammatical category : 
	prefix : tu	IV2FP
	stem : kotib	VERB_IMPERFECT
	suffix : na	IVSUFF_SUBJ:FP
Glossed as : 
	prefix : you [fem.pl.]
	stem : dictate/make write
	suffix : [fem.pl.]


SOLUTION #4
Lemma  : 	>akotab
Vocalized as : 	tukotabna
Morphology : 
	prefix : IVPref-Antn-tu
	stem : IV_Pass_yu
	suffix : IVSuff-n
Grammatical category : 
	prefix : tu	IV2FP
	stem : kotab	VERB_IMPERFECT
	suffix : na	IVSUFF_SUBJ:FP
Glossed as : 
	prefix : you [fem.pl.]
	stem : be dictated
	suffix : [fem.pl.]


SOLUTION #2
Lemma  : 	katab
Vocalized as : 	tukotabna
Morphology : 
	prefix : IVPref-Antn-tu
	stem : IV_Pass_yu
	suffix : IVSuff-n
Grammatical category : 
	prefix : tu	IV2FP
	stem : kotab	VERB_IMPERFECT
	suffix : na	IVSUFF_SUBJ:FP
Glossed as : 
	prefix : you [fem.pl.]
	stem : be written/be fated/be destined
	suffix : [fem.pl.]


SOLUTION #1
Lemma  : 	katab
Vocalized as : 	takotubna
Morphology : 
	prefix : IVPref-Antn-ta
	stem : IV
	suffix : IVSuff-n
Grammatical category : 
	prefix : ta	IV2FP
	stem : kotub	VERB_IMPERFECT
	suffix : na	IVSUFF_SUBJ:FP
Glossed as : 
	prefix : you [fem.pl.]
	stem : write
	suffix : [fem.pl.]	
		

Here, the decomposition in prefix(es), stem and suffix(es) becomes obvious.

Reminder
In function of its grammatical categories, AraMorph may have shifted several prefixes (to the left) and/or suffixes (to the right) from the stem. In other terms, the grammatical decomposition my be different of the morphological decomposition deducted from the dictionaries.

In order to understand that the analysis only deals with the stem :

java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar;lib/lucene-1.4.3.jar ¶
gpl.pierrick.brihaye.aramorph.test.TestArabicAnalyzer ¶
src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/tktbn.txt results.txt
		

Let's have a look at the results.txt output file :

kotub	VERB_IMPERFECT	[0-5]	1
kotab	VERB_IMPERFECT	[0-5]	0
kotib	VERB_IMPERFECT	[0-5]	0
		

We can effectively see that we get the stems of the different forms of the كتب root when it is used as an imperfective verb. Thus, using this system, the analysis of a second person feminine plural imperfective verb is the same for every imperfective form, whatever the person.

Note
This implementation may be rediscussed !

The english glosser for Lucene

Let's execute the following code :

java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar;lib/lucene-1.4.3.jar ¶
gpl.pierrick.brihaye.aramorph.test.TestArabicGlosser ¶
src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/ktAb.txt CP1256 results.txt UTF-8
		
Warning
Of course, the input file's encoding should be correctly configured.

Let's have a look at the results.txt output file :

kuttab	NOUN	[0-4]	1
village	NOUN	[0-4]	0
school	NOUN	[0-4]	0
quran	NOUN	[0-4]	0
school	NOUN	[0-4]	0
authors	NOUN	[0-4]	0
writers	NOUN	[0-4]	0
book	NOUN	[0-4]	0
		

Here, the principle is stricly the same except that the glosses are themselves tokenized by the WhitespaceFilter before being sent to Lucene's standard processing queue (StandardFilter, LowerCaseFilter and StopFilter).

Note
The token's type is the analyzed arabic stem's type ; it is obvious the the english word's type may be different.
Note
This implementation may be rediscussed !

What are the solutions retained by the analyzers and the glossers ?

As we have just seen, the analyzers return some tokens whose type is the stem's grammatical category.
However, an analyzer will consider that some types should be regarded as related to stop words and will not return any token when it encounters them ; it will filter them. The list of the grammatical categories considered as non significant is as follow :

  • DEM_PRON_F
  • DEM_PRON_FS
  • DEM_PRON_FD
  • DEM_PRON_MD
  • DEM_PRON_MP
  • DEM_PRON_MS
  • DET
  • INTERROG
  • NO_STEM
  • NUMERIC_COMMA
  • PART
  • PRON_1P
  • PRON_1S
  • PRON_2D
  • PRON_2FP
  • PRON_2FS
  • PRON_2MP
  • PRON_2MS
  • PRON_3D
  • PRON_3FP
  • PRON_3FS
  • PRON_3MP
  • PRON_3MS
  • REL_PRON

Conversely, this is the list of the grammatical categories that should be regarded as significant :

  • ABBREV
  • ADJ
  • ADV
  • NOUN
  • NOUN_PROP
  • VERB_IMPERATIVE
  • VERB_IMPERFECT
  • VERB_PERFECT
  • NO_RESULT
    Warning
    This result is kept since the experiments tend to show that there is a significant chance that this word is a foreign word missing in the dictionary. It is obviously possible to write a specific Lucene filter to refine the analysis of this type of token.
Reminder
The explanations about the grammatical categories are available in this section.
Note
This implementation may be rediscussed !