edu.northwestern.at.utils.swing
Class FileTokenizer

java.lang.Object
  extended by edu.northwestern.at.utils.swing.FileTokenizer
All Implemented Interfaces:
java.util.Iterator

public class FileTokenizer
extends java.lang.Object
implements java.util.Iterator

Tokenizes text from a text file.

A token is defined as text between word separator characters. The separator characters are defined below in the WORD_SEPARATOR_CHARACTERS array. The tokenizer keeps track of the starting and ending position of each token. This is necessary to support find/replace, spell checking, etc.

FileTokenizer implements the Iterator interface, but the optional remove() method is left as a no-op since there is no collection underlying this class.

Example:

// Tokenize file text and print out list of words. FileTokenizer tokenizer = new FileTokenizer( "myfile.txt" ); // While there are more characters // we haven't looked at ... while ( tokenizer.hasNext() ) { // Extract next word in document. String word = tokenizer.next(); // Print out word and its starting and // ending positions in the document text. System.out.println( word + " starts at " + tokenizer.getStartPos() + ", ends at " + tokenizer.getEndPos() ); }


Field Summary
protected  int currentPos
          Current position in document.
protected  javax.swing.text.Document document
          The document to tokenize.
protected  int endPos
          Ending position in document.
protected  javax.swing.text.Segment segment
          The current document segment.
protected static java.util.HashMap separatorHashMap
          Hash holds separator characters for quick access.
protected  int startPos
          Starting position in document.
static char[] WORD_SEPARATOR_CHARACTERS
          Characters that separate words.
static java.lang.String WORD_SEPARATOR_CHARACTERS_STRING
           
 
Constructor Summary
FileTokenizer(java.lang.String textFileName)
          Create document tokenizer.
 
Method Summary
protected static void createSeparatorHashMap()
          Creates word separator hash map from list of separator characters.
 int getEndPos()
          Get ending position in document for tokenization.
 int getStartPos()
          Get starting position in document for tokenization.
 boolean hasNext()
          Check if more characters available in document.
static boolean isSeparator(char ch)
          Checks if a character is a word separator.
 void moveToStartOfWord()
          Move to start of next word if current cursor is in the middle of a word.
 java.lang.Object next()
          Get next token in document.
 void remove()
          Removes last element returned by iterator (does nothing).
 void setPosition(int pos)
          Set position in document for tokenization.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

document

protected javax.swing.text.Document document
The document to tokenize.


segment

protected javax.swing.text.Segment segment
The current document segment.


startPos

protected int startPos
Starting position in document.


endPos

protected int endPos
Ending position in document.


currentPos

protected int currentPos
Current position in document.


WORD_SEPARATOR_CHARACTERS

public static final char[] WORD_SEPARATOR_CHARACTERS
Characters that separate words.

The single quote is not included as a word separator so that contractions can be picked up. It is up to the invoker to remove unwanted single quotes from a token. Likewise a "-" is not considered a separator so that words containing a dash can be worked with.


WORD_SEPARATOR_CHARACTERS_STRING

public static final java.lang.String WORD_SEPARATOR_CHARACTERS_STRING
See Also:
Constant Field Values

separatorHashMap

protected static java.util.HashMap separatorHashMap
Hash holds separator characters for quick access.

Constructor Detail

FileTokenizer

public FileTokenizer(java.lang.String textFileName)
              throws java.io.IOException,
                     javax.swing.text.BadLocationException
Create document tokenizer.

Parameters:
textFileName - Name of text file to tokenize.
Throws:
java.io.IOException
javax.swing.text.BadLocationException
Method Detail

isSeparator

public static boolean isSeparator(char ch)
Checks if a character is a word separator.

Parameters:
ch - The character to check.
Returns:
True if the character is a word separator.

Tests is a character is a separator by checking if the character is a key in the separatorHaspMap map. If so, the character is a separator.


createSeparatorHashMap

protected static void createSeparatorHashMap()
Creates word separator hash map from list of separator characters.

The separatorHashMap map uses each separator character as both a key and the key's value.


moveToStartOfWord

public void moveToStartOfWord()
Move to start of next word if current cursor is in the middle of a word.


hasNext

public boolean hasNext()
Check if more characters available in document.

Specified by:
hasNext in interface java.util.Iterator
Returns:
True if more characters in document.

next

public java.lang.Object next()
Get next token in document.

Specified by:
next in interface java.util.Iterator
Returns:
Next token in document as a string.

remove

public void remove()
Removes last element returned by iterator (does nothing).

Specified by:
remove in interface java.util.Iterator

getStartPos

public int getStartPos()
Get starting position in document for tokenization.

Returns:
The starting position.

getEndPos

public int getEndPos()
Get ending position in document for tokenization.

Returns:
The ending position.

setPosition

public void setPosition(int pos)
Set position in document for tokenization.

Parameters:
pos - The position.