AraMorph

the AraMorph site
 
   

Installation and tests

Building from source code

The source code includes the necessary Ant librairies together with a build.xml file.

Warning
For now, the code makes a heavy use of regular expressions so that AraMorph needs to be build and used on a JDK 1.4 or above. The code modification that would allow the use of external regular expressions libraries that could operate with an older JDK is still to be done.

The build is simply done when build.bat is invoked together with a target's name.

Fixme (???)
Many thanks to whom will provide me a functionnal Unix script for the build.

The available targets are :

TargetAction
compileCompiles the source code. By default, the result will be generated into ${dist}/src.
jar Builds the ArabicAnalyzer.jar file. By default, the result will be generated into ${dist}.
According to the value of the with.sources property (true by default), the source files are included in the file.
zipBuilds a ArabicAnalyzer-src.zip file including the source files. By default, the result will be generated into ${dist}.
distBuilds a ArabicAnalyzer-dist.zip file including all the distribution files. By default, the result will be generated into ${dist}.
javadocBuilds the javadocs of the source files. By default, the result will be generated into ${dist}/javadoc.
site Builds the HTML documentation using Apache Forrest.
By default, the result will be generated into ${dist}/html. You may define Forrest's path from the FORREST_HOME environment variable or define the forrest.home property in a ./forrest.properties file.
Warning
The target is designed for Forrest version 0.5.1. Many thanks to whom will provide me a build.xml file for newer versions.
cleanDeletes the build directory, by default ${dist}
helpDisplay some help about the different available targets (default target).
Note
By default, ${dist} is set to ./build.

Testing the morphological analyzer

The morphological analyzer is used as such :

java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar ¶
	gpl.pierrick.brihaye.aramorph.AraMorph
Warning
The classpath must obviously point to the actual location of the ArabicAnalyzer.jar and commons-collections.jar files.

As such, i.e. without arguments, the program displays the following help message :

Arabic Morphological Analyzer for Java(tm)
Ported to Java(tm) by Pierrick Brihaye, 2003-2004.
Based on :
BUCKWALTER ARABIC MORPHOLOGICAL ANALYZER
Portions (c) 2002 QAMUS LLC (www.qamus.org),
(c) 2002 Trustees of the University of Pennsylvania.
This program is governed by :
The GNU General Public License

Usage :

araMorph inFile [inEncoding] [outFile] [outEncoding] [-v]

inFile : file to be analyzed
inEncoding : encoding for inFile, default CP1256
outFile : result file, default console
outEncoding : encoding for outFile, if not specified use Buckwalter transliterat
ion with system's file.encoding
-v : verbose mode

	

The parameters should not raise any particular problem :

ParameterUsage
inFileThe path of a text file to be analyzed (mandatory)
inEncodingThis text file's encoding (CP1256 by default)
outFileThe path of the file where the results of the morphological analysis should be output (the console by default)
outEncodingThe encoding of the results file (by default, the JVM's file.encoding system property, using Buckwalter's transliteration)
-vA flag to be set for more verbosity

Here are a few usage examples :

java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar ¶
gpl.pierrick.brihaye.aramorph.AraMorph ¶
src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/cp1256.txt

java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar ¶
gpl.pierrick.brihaye.aramorph.AraMorph ¶
src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/UTF-8.txt UTF-8 -v

java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar ¶
gpl.pierrick.brihaye.aramorph.AraMorph ¶
src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/iso-8859-6.txt ¶
iso-8859-6 results.txt CP1256
		

Testing the arabic analyzer for Lucene

The parameters are the same except -v which is unuseful. The Lucene jar file must however be included in the classpath.

java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar;lib/lucene-1.4.3.jar ¶
gpl.pierrick.brihaye.aramorph.test.TestArabicAnalyzer ¶
src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/cp1256.txt results.txt
		
Warning
The classpath must obviously point to the actual location of the ArabicAnalyzer.jar, commons-collections.jar and lucene*.jar files.

Testing the english analyzer for Lucene

Note
Yes, it is possible to analyze an arabic text and return english tokens !
Don't expect too much however, AraMorph isn't designed to be an automatic translation tool.

The parameters are the same except -v which is unuseful. The Lucene jar file must however be included in the classpath.

java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar;lib/lucene-1.4.3.jar ¶
gpl.pierrick.brihaye.aramorph.test.TestArabicGlosser ¶
src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/cp1256.txt results.txt
		
Warning
The classpath must obviously point to the actual location of the ArabicAnalyzer.jar, commons-collections.jar and lucene*.jar files.