edu.northwestern.at.utils.corpuslinguistics.stemmer
Class LancasterStemmer

java.lang.Object
  extended by edu.northwestern.at.utils.corpuslinguistics.stemmer.LancasterStemmer
All Implemented Interfaces:
Stemmer

public class LancasterStemmer
extends java.lang.Object
implements Stemmer

LancasterStemmer: Implements the Lancaster (Paice/Husk) word stemmer.

Paice/Husk Stemmer - License Statement.

This software was designed and developed at Lancaster University, Lancaster, UK, under the supervision of Dr Chris Paice. It is fully in the public domain, and may be used or adapted by any organisation or individual. Neither Dr Paice nor Lancaster University accepts any responsibility whatsoever for its use by other parties, and makes no guarantees, expressed or implied, about its quality, reliability, or any other characteristic.

It is assumed that, as a matter of professional courtesy, anyone who incorporates this software into a system of their own, whether for commercial or research purposes, will acknowledge the source of the code.

Modified from the original Java programs written by Christopher O'Neill and Rob Hooper for use in WordHoard.


Field Summary
static java.lang.String[] defaultStemmingRules
          Default stemming rules.
static java.lang.String[] prefixes
          Prefixes to remove from words before stemming.
protected  boolean preStrip
           
protected  java.util.Vector ruleTable
           
protected  int[] ruleTableIndex
           
protected static char zeroDigit
          Character for "0" digit.
 
Constructor Summary
LancasterStemmer()
          Create a Paice/Husk stemmer using the default stemming rules.
LancasterStemmer(java.lang.String[] rules)
          Create a Paice/Husk stemmer from a string list of rules.
LancasterStemmer(java.lang.String[] rules, boolean preStrip)
          Create a Paice/Husk stemmer from a string list of rules.
 
Method Summary
protected  int charCode(char ch)
          Converts a lower case letter to an index.
protected  java.lang.String clean(java.lang.String s)
          Remove non-letters from a string.
protected  int firstVowel(java.lang.String s, int last)
          Returns index of first vowel in string.
protected  boolean isDigit(char ch)
          Determine if character is a digit.
protected  boolean isLetter(char ch)
          Determine if character is a letter.
protected  boolean isVowel(char ch)
          Determine if character is a vowel or not.
protected  void loadRules(java.lang.String[] rules)
          Loads the stemming rules.
 java.lang.String stem(java.lang.String s)
          Stem a specified string.
protected  java.lang.String stripPrefixes(java.lang.String s)
          Removes prefixes from a string.
protected  java.lang.String stripSuffixes(java.lang.String s)
          Strip suffixes from a string.
protected  boolean vowel(char ch, char prev)
          Determine if character is a vowel or not.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

prefixes

public static final java.lang.String[] prefixes
Prefixes to remove from words before stemming.


defaultStemmingRules

public static final java.lang.String[] defaultStemmingRules
Default stemming rules.

These rules MUST be stored in ascending alphanumeric order of the first character.


zeroDigit

protected static final char zeroDigit
Character for "0" digit.

See Also:
Constant Field Values

ruleTable

protected java.util.Vector ruleTable

ruleTableIndex

protected int[] ruleTableIndex

preStrip

protected boolean preStrip
Constructor Detail

LancasterStemmer

public LancasterStemmer()
Create a Paice/Husk stemmer using the default stemming rules.

Throws:
StemmerException - if something goes wrong.

Prefixes are automatically removed from words with more than two characters.


LancasterStemmer

public LancasterStemmer(java.lang.String[] rules)
Create a Paice/Husk stemmer from a string list of rules.

Parameters:
rules - The stemming rules as an array of String.

Prefixes are automatically removed from words with more than two characters.


LancasterStemmer

public LancasterStemmer(java.lang.String[] rules,
                        boolean preStrip)
Create a Paice/Husk stemmer from a string list of rules.

Parameters:
rules - The stemming rules as an array of String.
preStrip - True to remove prefixes from words with more than two characters.

Prefixes are automatically removed from words with more than two characters.

Method Detail

loadRules

protected void loadRules(java.lang.String[] rules)
Loads the stemming rules.

Parameters:
rules - String array of rules.

firstVowel

protected int firstVowel(java.lang.String s,
                         int last)
Returns index of first vowel in string.

Parameters:
s - String to search for vowel.
last - Last position to search for vowel.
Returns:
Zero-based index of first vowel in string.

stripSuffixes

protected java.lang.String stripSuffixes(java.lang.String s)
Strip suffixes from a string.

Parameters:
s - The string from which to remove suffixes.
Returns:
The string with suffixes removed.

isVowel

protected boolean isVowel(char ch)
Determine if character is a vowel or not.

Parameters:
ch - The potential vowel.
Returns:
true if the character is a vowel (a, e, i, o, u).

vowel

protected boolean vowel(char ch,
                        char prev)
Determine if character is a vowel or not.

Parameters:
ch - The potential vowel.
prev - The previous character.
Returns:
true if the character is a vowel.

When the character is a "y", the previous character is checked to see if it is a vowel. If so, "y" is not considered a vowel.


isDigit

protected boolean isDigit(char ch)
Determine if character is a digit.

Parameters:
ch - The character to check.
Returns:
true if "ch" is a digit ('0' .. '9').

isLetter

protected boolean isLetter(char ch)
Determine if character is a letter.

Parameters:
ch - The character to check.
Returns:
true if "ch" is a letter ('a' .. 'z').

charCode

protected int charCode(char ch)
Converts a lower case letter to an index.

Parameters:
ch - The character. Must be in the range 'a' .. 'z'.
Returns:
The index, where 'a' = 0 .

stripPrefixes

protected java.lang.String stripPrefixes(java.lang.String s)
Removes prefixes from a string.

Parameters:
s - The string from which to remove prefixes.
Returns:
The string with prefixes removed.

clean

protected java.lang.String clean(java.lang.String s)
Remove non-letters from a string.

Parameters:
s - String from which to remove non-letters.
Returns:
String with non-letters removed.

stem

public java.lang.String stem(java.lang.String s)
Stem a specified string.

Specified by:
stem in interface Stemmer
Parameters:
s - The string to stem.
Returns:
The stemmed string.