DocumentTokenizer (WordHoard)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.northwestern.at.utils.swing
Class DocumentTokenizer

java.lang.Object
  edu.northwestern.at.utils.swing.DocumentTokenizer

All Implemented Interfaces:: java.util.Iterator

public class DocumentTokenizer
extends java.lang.Object
implements java.util.Iterator
extends java.lang.Object
implements java.util.Iterator

Tokenizes document text.

A token is defined as text between word separator characters. The separator characters are defined below in the WORD_SEPARATOR_CHARACTERS array. The tokenizer keeps track of the starting and ending position of each token. This is necessary to support find/replace, spell checking, etc.

DocumentTokenizer implements the Iterator interface, but the optional remove() method is left as a no-op since there is no collection underlying this class.

Example:

// Tokenize a document and print out list of words. // Start at position 0 (the beginning of the document). // Get the document from a JTextPane. Document document = textPane.getDocument(); DocumentTokenizer tokenizer = new DocumentTokenizer( document , 0 ); // While there are more characters // we haven't looked at ... while ( tokenizer.hasNext() ) { // Extract next word in document. String word = tokenizer.next(); // Print out word and its starting and // ending positions in the document text. System.out.println( word + " starts at " + tokenizer.getStartPos() + ", ends at " + tokenizer.getEndPos() ); }

Field Summary
`protected int`	`currentPos` Current position in document.
`protected javax.swing.text.Document`	`document` The document to tokenize.
`protected int`	`endPos` Ending position in document.
`protected javax.swing.text.Segment`	`segment` The current document segment.
`protected static java.util.HashMap`	`separatorHashMap` Hash holds separator characters for quick access.
`protected int`	`startPos` Starting position in document.
`static char[]`	`WORD_SEPARATOR_CHARACTERS` Characters that separate words.
`static java.lang.String`	`WORD_SEPARATOR_CHARACTERS_STRING`

Constructor Summary
`DocumentTokenizer(javax.swing.text.Document document, int offset)` Create document tokenizer.

Method Summary
`protected static void`	`createSeparatorHashMap()` Creates word separator hash map from list of separator characters.
`int`	`getEndPos()` Get ending position in document for tokenization.
`int`	`getStartPos()` Get starting position in document for tokenization.
`boolean`	`hasNext()` Check if more characters available in document.
`static boolean`	`isSeparator(char ch)` Checks if a character is a word separator.
`void`	`moveToStartOfWord()` Move to start of next word if current cursor is in the middle of a word.
`java.lang.Object`	`next()` Get next token in document.
`void`	`remove()` Removes last element returned by iterator (does nothing).
`void`	`setPosition(int pos)` Set position in document for tokenization.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

document

protected javax.swing.text.Document document

The document to tokenize.

segment

protected javax.swing.text.Segment segment

The current document segment.

startPos

protected int startPos

Starting position in document.

endPos

protected int endPos

Ending position in document.

currentPos

protected int currentPos

Current position in document.

WORD_SEPARATOR_CHARACTERS

public static final char[] WORD_SEPARATOR_CHARACTERS

Characters that separate words.

The single quote is not included as a word separator so that contractions can be picked up. It is up to the invoker to remove unwanted single quotes from a token. Likewise a "-" is not considered a separator so that words containing a dash can be worked with.

WORD_SEPARATOR_CHARACTERS_STRING

public static final java.lang.String WORD_SEPARATOR_CHARACTERS_STRING

See Also:: Constant Field Values

separatorHashMap

protected static java.util.HashMap separatorHashMap

Hash holds separator characters for quick access.

Constructor Detail