gpl.pierrick.brihaye.aramorph.lucene
Class ArabicTokenizer

java.lang.Object
  extended byorg.apache.lucene.analysis.TokenStream
      extended byorg.apache.lucene.analysis.Tokenizer
          extended bygpl.pierrick.brihaye.aramorph.lucene.ArabicTokenizer

public class ArabicTokenizer
extends org.apache.lucene.analysis.Tokenizer

A tokenizer that will return tokens in the arabic alphabet. This tokenizer is a bit rude since it also filters digits and punctuation, even in an arabic part of stream. Well... I've planned to write a "universal", highly configurable, character tokenizer.

Author:
Pierrick Brihaye, 2003

Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Constructor Summary
ArabicTokenizer(java.io.Reader input)
          Constructs a tokenizer that will return tokens in the arabic alphabet.
ArabicTokenizer(java.io.Reader input, boolean debug)
          Constructs a tokenizer that will return tokens in the arabic alphabet.
 
Method Summary
protected  boolean isArabicChar(char c)
          Whether or not a character is in the arabic alphabet.
 org.apache.lucene.analysis.Token next()
          Returns the next token in the stream, or null at EOS.
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ArabicTokenizer

public ArabicTokenizer(java.io.Reader input)
Constructs a tokenizer that will return tokens in the arabic alphabet.

Parameters:
input - The reader

ArabicTokenizer

public ArabicTokenizer(java.io.Reader input,
                       boolean debug)
Constructs a tokenizer that will return tokens in the arabic alphabet.

Parameters:
input - The reader
debug - Whether or not the tokenizer should display convenience messages on System.out
Method Detail

isArabicChar

protected boolean isArabicChar(char c)
Whether or not a character is in the arabic alphabet.

Parameters:
c - The char
Returns:
The result

next

public org.apache.lucene.analysis.Token next()
                                      throws java.io.IOException
Returns the next token in the stream, or null at EOS.

Returns:
The token with its type set to ARABIC
Throws:
java.io.IOException - If a problem occurs