edu.northwestern.at.utils.swing
Class DocumentTokenizer

java.lang.Object
  extended by edu.northwestern.at.utils.swing.DocumentTokenizer
All Implemented Interfaces:
java.util.Iterator

public class DocumentTokenizer
extends java.lang.Object
implements java.util.Iterator

Tokenizes document text.

A token is defined as text between word separator characters. The separator characters are defined below in the WORD_SEPARATOR_CHARACTERS array. The tokenizer keeps track of the starting and ending position of each token. This is necessary to support find/replace, spell checking, etc.

DocumentTokenizer implements the Iterator interface, but the optional remove() method is left as a no-op since there is no collection underlying this class.

Example:

// Tokenize a document and print out list of words. // Start at position 0 (the beginning of the document). // Get the document from a JTextPane. Document document = textPane.getDocument(); DocumentTokenizer tokenizer = new DocumentTokenizer( document , 0 ); // While there are more characters // we haven't looked at ... while ( tokenizer.hasNext() ) { // Extract next word in document. String word = tokenizer.next(); // Print out word and its starting and // ending positions in the document text. System.out.println( word + " starts at " + tokenizer.getStartPos() + ", ends at " + tokenizer.getEndPos() ); }


Field Summary
protected  int currentPos
          Current position in document.
protected  javax.swing.text.Document document
          The document to tokenize.
protected  int endPos
          Ending position in document.
protected  javax.swing.text.Segment segment
          The current document segment.
protected static java.util.HashMap separatorHashMap
          Hash holds separator characters for quick access.
protected  int startPos
          Starting position in document.
static char[] WORD_SEPARATOR_CHARACTERS
          Characters that separate words.
static java.lang.String WORD_SEPARATOR_CHARACTERS_STRING
           
 
Constructor Summary
DocumentTokenizer(javax.swing.text.Document document, int offset)
          Create document tokenizer.
 
Method Summary
protected static void createSeparatorHashMap()
          Creates word separator hash map from list of separator characters.
 int getEndPos()
          Get ending position in document for tokenization.
 int getStartPos()
          Get starting position in document for tokenization.
 boolean hasNext()
          Check if more characters available in document.
static boolean isSeparator(char ch)
          Checks if a character is a word separator.
 void moveToStartOfWord()
          Move to start of next word if current cursor is in the middle of a word.
 java.lang.Object next()
          Get next token in document.
 void remove()
          Removes last element returned by iterator (does nothing).
 void setPosition(int pos)
          Set position in document for tokenization.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

document

protected javax.swing.text.Document document
The document to tokenize.


segment

protected javax.swing.text.Segment segment
The current document segment.


startPos

protected int startPos
Starting position in document.


endPos

protected int endPos
Ending position in document.


currentPos

protected int currentPos
Current position in document.


WORD_SEPARATOR_CHARACTERS

public static final char[] WORD_SEPARATOR_CHARACTERS
Characters that separate words.

The single quote is not included as a word separator so that contractions can be picked up. It is up to the invoker to remove unwanted single quotes from a token. Likewise a "-" is not considered a separator so that words containing a dash can be worked with.


WORD_SEPARATOR_CHARACTERS_STRING

public static final java.lang.String WORD_SEPARATOR_CHARACTERS_STRING
See Also:
Constant Field Values

separatorHashMap

protected static java.util.HashMap separatorHashMap
Hash holds separator characters for quick access.

Constructor Detail

DocumentTokenizer

public DocumentTokenizer(javax.swing.text.Document document,
                         int offset)
Create document tokenizer.

Parameters:
document - Document to tokenize.
offset - Offset in document to start at.
Method Detail

isSeparator

public static boolean isSeparator(char ch)
Checks if a character is a word separator.

Parameters:
ch - The character to check.
Returns:
True if the character is a word separator.

Tests is a character is a separator by checking if the character is a key in the separatorHaspMap map. If so, the character is a separator.


createSeparatorHashMap

protected static void createSeparatorHashMap()
Creates word separator hash map from list of separator characters.

The separatorHashMap map uses each separator character as both a key and the key's value.


moveToStartOfWord

public void moveToStartOfWord()
Move to start of next word if current cursor is in the middle of a word.


hasNext

public boolean hasNext()
Check if more characters available in document.

Specified by:
hasNext in interface java.util.Iterator
Returns:
True if more characters in document.

next

public java.lang.Object next()
Get next token in document.

Specified by:
next in interface java.util.Iterator
Returns:
Next token in document as a string.

remove

public void remove()
Removes last element returned by iterator (does nothing).

Specified by:
remove in interface java.util.Iterator

getStartPos

public int getStartPos()
Get starting position in document for tokenization.

Returns:
The starting position.

getEndPos

public int getEndPos()
Get ending position in document for tokenization.

Returns:
The ending position.

setPosition

public void setPosition(int pos)
Set position in document for tokenization.

Parameters:
pos - The position.