edu.northwestern.at.utils.corpuslinguistics
Class WordCountExtractor

java.lang.Object
  extended by edu.northwestern.at.utils.corpuslinguistics.WordCountExtractor

public class WordCountExtractor
extends java.lang.Object

Counts words in a text.


Field Summary
protected  java.lang.String[] uniqueWords
          String array of unique words.
protected  java.util.TreeMap wordCounts
          The list of words and word counts in the text.
(package private)  java.lang.String[] words
          The text parsed into a string array of words.
 
Constructor Summary
WordCountExtractor(java.util.ArrayList wordList)
          Extract word counts from an arraylist of words.
WordCountExtractor(java.lang.String[] words)
          Extract word counts from a string array of words.
WordCountExtractor(java.lang.String fileName, java.lang.String encoding)
          Extract word counts from a text file.
 
Method Summary
protected  void generateWordCountExtractor()
          Compute word counts from a string array of words.
 int getNumberOfUniqueWords()
          Return the number of unique words.
 int getNumberOfWords()
          Return the total number of words.
 java.lang.String[] getUniqueWords()
          Return unique words as a string array.
 int getWordCount(java.lang.String word)
          Return count for a specific word.
 java.util.Map getWordCounts()
          Return word count map.
 java.lang.String[] getWords()
          Return tokenized text words as a string array.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

wordCounts

protected java.util.TreeMap wordCounts
The list of words and word counts in the text.

Key=word
Value=Integer(count)


words

java.lang.String[] words
The text parsed into a string array of words. Package scope for the benefit of NGramExtractor.


uniqueWords

protected java.lang.String[] uniqueWords
String array of unique words.

Constructor Detail

WordCountExtractor

public WordCountExtractor(java.lang.String[] words)
Extract word counts from a string array of words.

Parameters:
words - The string array with the words.

WordCountExtractor

public WordCountExtractor(java.util.ArrayList wordList)
Extract word counts from an arraylist of words.

Parameters:
wordList - The arraylist with the words.

WordCountExtractor

public WordCountExtractor(java.lang.String fileName,
                          java.lang.String encoding)
Extract word counts from a text file.

Parameters:
fileName - The file containing the text to analyze.
encoding - The encoding for the text file (.e.g, "utf-8").
Method Detail

generateWordCountExtractor

protected void generateWordCountExtractor()
Compute word counts from a string array of words.


getWords

public java.lang.String[] getWords()
Return tokenized text words as a string array.

Returns:
The string array of words.

getNumberOfWords

public int getNumberOfWords()
Return the total number of words.

Returns:
The number of words.

getUniqueWords

public java.lang.String[] getUniqueWords()
Return unique words as a string array.

Returns:
The string array of unique words.

getNumberOfUniqueWords

public int getNumberOfUniqueWords()
Return the number of unique words.

Returns:
The number of unique words.

getWordCount

public int getWordCount(java.lang.String word)
Return count for a specific word.

Parameters:
word - The word whose count is desired.
Returns:
The count of the word in the text.

getWordCounts

public java.util.Map getWordCounts()
Return word count map.

Returns:
Word count map.