gpl.pierrick.brihaye.aramorph.lucene
Class ArabicStemAnalyzer

java.lang.Object
  extended byorg.apache.lucene.analysis.Analyzer
      extended bygpl.pierrick.brihaye.aramorph.lucene.ArabicStemAnalyzer

public final class ArabicStemAnalyzer
extends org.apache.lucene.analysis.Analyzer

Analyzer for the arabic language. This analyzer uses Tim Buckwalter's algorithm (available at LDC Catalog) to identify the morphological category of arabic tokens. The significant grammatical categories are still to be determined but the current list gives good results. Final tokens are a romanized version of the canonical word.

Author:
Pierrick Brihaye, 2003

Field Summary
protected  boolean outputBuckwalter
          Whether or not the analyzer should output tokens in the Buckwalter transliteration system
 
Constructor Summary
ArabicStemAnalyzer()
          Constructs an analyzer that will return grammatically significant arabic tokens in the Buckwalter transliteration system.
ArabicStemAnalyzer(boolean outputBuckwalter)
          Constructs an analyzer that will return grammatically significant arabic tokens.
 
Method Summary
 org.apache.lucene.analysis.TokenStream tokenStream(java.lang.String FieldName, java.io.Reader reader)
          Returns a token stream of arabic words whose grammatically categories are found to be significant.
 
Methods inherited from class org.apache.lucene.analysis.Analyzer
tokenStream
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

outputBuckwalter

protected boolean outputBuckwalter
Whether or not the analyzer should output tokens in the Buckwalter transliteration system

Constructor Detail

ArabicStemAnalyzer

public ArabicStemAnalyzer()
Constructs an analyzer that will return grammatically significant arabic tokens in the Buckwalter transliteration system.


ArabicStemAnalyzer

public ArabicStemAnalyzer(boolean outputBuckwalter)
Constructs an analyzer that will return grammatically significant arabic tokens.

Parameters:
outputBuckwalter - Whether or not the tokens should be translitered
Method Detail

tokenStream

public org.apache.lucene.analysis.TokenStream tokenStream(java.lang.String FieldName,
                                                          java.io.Reader reader)
Returns a token stream of arabic words whose grammatically categories are found to be significant.

Parameters:
reader - The reader
Returns:
The token stream