FileTokenizer (WordHoard)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.northwestern.at.utils.swing
Class FileTokenizer

java.lang.Object
  edu.northwestern.at.utils.swing.FileTokenizer

All Implemented Interfaces:: java.util.Iterator

public class FileTokenizer
extends java.lang.Object
implements java.util.Iterator
extends java.lang.Object
implements java.util.Iterator

Tokenizes text from a text file.

A token is defined as text between word separator characters. The separator characters are defined below in the WORD_SEPARATOR_CHARACTERS array. The tokenizer keeps track of the starting and ending position of each token. This is necessary to support find/replace, spell checking, etc.

FileTokenizer implements the Iterator interface, but the optional remove() method is left as a no-op since there is no collection underlying this class.

Example:

// Tokenize file text and print out list of words. FileTokenizer tokenizer = new FileTokenizer( "myfile.txt" ); // While there are more characters // we haven't looked at ... while ( tokenizer.hasNext() ) { // Extract next word in document. String word = tokenizer.next(); // Print out word and its starting and // ending positions in the document text. System.out.println( word + " starts at " + tokenizer.getStartPos() + ", ends at " + tokenizer.getEndPos() ); }

Field Summary
`protected int`	`currentPos` Current position in document.
`protected javax.swing.text.Document`	`document` The document to tokenize.
`protected int`	`endPos` Ending position in document.
`protected javax.swing.text.Segment`	`segment` The current document segment.
`protected static java.util.HashMap`	`separatorHashMap` Hash holds separator characters for quick access.
`protected int`	`startPos` Starting position in document.
`static char[]`	`WORD_SEPARATOR_CHARACTERS` Characters that separate words.
`static java.lang.String`	`WORD_SEPARATOR_CHARACTERS_STRING`

Constructor Summary
`FileTokenizer(java.lang.String textFileName)` Create document tokenizer.

Method Summary
`protected static void`	`createSeparatorHashMap()` Creates word separator hash map from list of separator characters.
`int`	`getEndPos()` Get ending position in document for tokenization.
`int`	`getStartPos()` Get starting position in document for tokenization.
`boolean`	`hasNext()` Check if more characters available in document.
`static boolean`	`isSeparator(char ch)` Checks if a character is a word separator.
`void`	`moveToStartOfWord()` Move to start of next word if current cursor is in the middle of a word.
`java.lang.Object`	`next()` Get next token in document.
`void`	`remove()` Removes last element returned by iterator (does nothing).
`void`	`setPosition(int pos)` Set position in document for tokenization.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

document

protected javax.swing.text.Document document

The document to tokenize.

segment

protected javax.swing.text.Segment segment

The current document segment.

startPos

protected int startPos

Starting position in document.

endPos

protected int endPos

Ending position in document.

currentPos

protected int currentPos

Current position in document.

WORD_SEPARATOR_CHARACTERS

public static final char[] WORD_SEPARATOR_CHARACTERS

Characters that separate words.

The single quote is not included as a word separator so that contractions can be picked up. It is up to the invoker to remove unwanted single quotes from a token. Likewise a "-" is not considered a separator so that words containing a dash can be worked with.

WORD_SEPARATOR_CHARACTERS_STRING

public static final java.lang.String WORD_SEPARATOR_CHARACTERS_STRING

See Also:: Constant Field Values

separatorHashMap

protected static java.util.HashMap separatorHashMap

Hash holds separator characters for quick access.

Constructor Detail

FileTokenizer

public FileTokenizer(java.lang.String textFileName)
              throws java.io.IOException,
                     javax.swing.text.BadLocationException

Create document tokenizer.

Parameters:: textFileName - Name of text file to tokenize.
Throws:: java.io.IOException; javax.swing.text.BadLocationException

Method Detail

isSeparator

public static boolean isSeparator(char ch)

Checks if a character is a word separator.

Parameters:: ch - The character to check.
Returns:: True if the character is a word separator.
Tests is a character is a separator by checking if the character is a key in the separatorHaspMap map. If so, the character is a separator.

createSeparatorHashMap

protected static void createSeparatorHashMap()

Creates word separator hash map from list of separator characters.

The separatorHashMap map uses each separator character as both a key and the key's value.

moveToStartOfWord

public void moveToStartOfWord()

Move to start of next word if current cursor is in the middle of a word.

hasNext

public boolean hasNext()

Check if more characters available in document.

Specified by:: hasNext in interface java.util.Iterator

Returns:: True if more characters in document.

public java.lang.Object next()

Get next token in document.

Specified by:: next in interface java.util.Iterator

Returns:: Next token in document as a string.

remove

public void remove()

Removes last element returned by iterator (does nothing).

Specified by:: remove in interface java.util.Iterator

getStartPos

public int getStartPos()

Get starting position in document for tokenization.

Returns:: The starting position.

getEndPos

public int getEndPos()

Get ending position in document for tokenization.

Returns:: The ending position.

setPosition

public void setPosition(int pos)

Set position in document for tokenization.

Parameters:: pos - The position.