fr.gouv.culture.sdx.search.lucene.analysis
Class Glosser_ar_en

java.lang.Object
  extended by org.apache.lucene.analysis.Analyzer
      extended by fr.gouv.culture.sdx.search.lucene.analysis.AbstractAnalyzer
          extended by fr.gouv.culture.sdx.search.lucene.analysis.Glosser_ar_en
All Implemented Interfaces:
Analyzer, java.io.Serializable, org.apache.avalon.framework.configuration.Configurable, org.apache.avalon.framework.logger.LogEnabled, org.apache.excalibur.xml.sax.XMLizable

public final class Glosser_ar_en
extends AbstractAnalyzer

An english glosser for the arabic language. This glosser uses Tim Buckwalter's algorithm (available at LDC Catalog) to identify the morphological category of arabic tokens and then return their glosses. The meaningful morphological categories are still to be determined but the current list gives good results.

Author:
Pierrick Brihaye, 2003
See Also:
Serialized Form

Field Summary
protected static java.lang.String ANALYZER_TYPE
           
static java.lang.String[] STOP_WORDS
          An array containing some common english words that are usually not useful for searching.
 
Fields inherited from class fr.gouv.culture.sdx.search.lucene.analysis.AbstractAnalyzer
logger
 
Constructor Summary
Glosser_ar_en()
           
 
Method Summary
 void configure(org.apache.avalon.framework.configuration.Configuration configuration)
          Configure the glosser.
 void enableLogging(org.apache.avalon.framework.logger.Logger logger)
          Transmits a super.getLog() to the class.
protected  java.lang.String getAnalyzerType()
           
 org.apache.lucene.analysis.TokenStream tokenStream(java.io.Reader reader)
          Deprecated. use tokenStream(String, Reader) instead.
 org.apache.lucene.analysis.TokenStream tokenStream(java.lang.String fieldName, java.io.Reader reader)
          Returns a token stream of glosses of arabic words whose morphological categories are found to be semantically meaningful.
 
Methods inherited from class fr.gouv.culture.sdx.search.lucene.analysis.AbstractAnalyzer
toSAX
 
Methods inherited from class org.apache.lucene.analysis.Analyzer
getPositionIncrementGap
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ANALYZER_TYPE

protected static final java.lang.String ANALYZER_TYPE
See Also:
Constant Field Values

STOP_WORDS

public static final java.lang.String[] STOP_WORDS
An array containing some common english words that are usually not useful for searching.

Constructor Detail

Glosser_ar_en

public Glosser_ar_en()
Method Detail

getAnalyzerType

protected java.lang.String getAnalyzerType()
Specified by:
getAnalyzerType in class AbstractAnalyzer

configure

public void configure(org.apache.avalon.framework.configuration.Configuration configuration)
               throws org.apache.avalon.framework.configuration.ConfigurationException
Configure the glosser.

Specified by:
configure in interface org.apache.avalon.framework.configuration.Configurable
Overrides:
configure in class AbstractAnalyzer
Parameters:
configuration - The configuration object
Throws:
org.apache.avalon.framework.configuration.ConfigurationException - If a problem occurs during configuration

enableLogging

public void enableLogging(org.apache.avalon.framework.logger.Logger logger)
Transmits a super.getLog() to the class.

Specified by:
enableLogging in interface org.apache.avalon.framework.logger.LogEnabled
Overrides:
enableLogging in class AbstractAnalyzer
Parameters:
logger - The super.getLog()

tokenStream

public org.apache.lucene.analysis.TokenStream tokenStream(java.lang.String fieldName,
                                                          java.io.Reader reader)
Returns a token stream of glosses of arabic words whose morphological categories are found to be semantically meaningful.

Specified by:
tokenStream in interface Analyzer
Specified by:
tokenStream in class org.apache.lucene.analysis.Analyzer
Parameters:
reader - The reader
Returns:
The token stream

tokenStream

public org.apache.lucene.analysis.TokenStream tokenStream(java.io.Reader reader)
Deprecated. use tokenStream(String, Reader) instead.

Creates a TokenStream which tokenizes all the text in the provided Reader. Provided for backward compatibility only.

See Also:
Analyzer.tokenStream(java.io.Reader)


Copyright © 2000-2010 Ministere de la culture et de la communication / AJLSM. All Rights Reserved.