edu.northwestern.at.utils.corpuslinguistics
Class NGramExtractor

java.lang.Object
  extended by edu.northwestern.at.utils.corpuslinguistics.NGramExtractor

public class NGramExtractor
extends java.lang.Object

Extract ngrams from text.


Field Summary
protected  java.util.TreeMap nGramCounts
          The list of ngrams and associated counts.
(package private)  int nGramSize
          Number of words forming an ngram.
protected  int numberOfNGrams
          Total number of ngrams.
(package private)  int windowSize
          Window size within which to search for ngrams.
protected  WordCountExtractor wordCountExtractor
          The WordCountExtractor with the list of words to analyze.
 
Constructor Summary
NGramExtractor(java.util.ArrayList wordList, int nGramSize, int windowSize)
          Create NGram analysis from an arraylist of words.
NGramExtractor(java.lang.String[] words, int nGramSize, int windowSize)
          Create NGram analysis from string array of words.
NGramExtractor(java.lang.String fileName, java.lang.String encoding, int nGramSize, int windowSize)
          Create NGram analysis of a text file.
NGramExtractor(WordCountExtractor wordCountExtractor, int nGramSize, int windowSize)
          Create NGram analysis from a WordCountExtractor.
 
Method Summary
protected  void generateNGrams()
          Generate NGram analysis from string array of words.
 int getNGramCount(java.lang.String ngram)
          Return count for a specific ngram.
 java.util.SortedMap getNGramMap()
          Return NGram map.
 java.lang.String[] getNGrams()
          Return NGrams.
 int getNumberOfNGrams()
          Returns the total number of ngrams.
 int getNumberOfUniqueNGrams()
          Returns the number of unique ngrams.
 void mergeNGramExtractor(NGramExtractor extractor)
          Merge ngrams from another NGramExtractor.
static java.lang.String[] splitNGramIntoWords(java.lang.String ngram)
          Returns the individual words comprising an ngram.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

wordCountExtractor

protected WordCountExtractor wordCountExtractor
The WordCountExtractor with the list of words to analyze.


nGramSize

int nGramSize
Number of words forming an ngram.


windowSize

int windowSize
Window size within which to search for ngrams.


nGramCounts

protected java.util.TreeMap nGramCounts
The list of ngrams and associated counts.

Key=ngram string
Value=Integer(count)

The ngram string is two or more words with a tab character ("\t") separating the words.


numberOfNGrams

protected int numberOfNGrams
Total number of ngrams.

Constructor Detail

NGramExtractor

public NGramExtractor(java.lang.String[] words,
                      int nGramSize,
                      int windowSize)
Create NGram analysis from string array of words.

Parameters:
words - The string array with the words.
nGramSize - The number of words forming an ngram.
windowSize - The window size (number of words) within which to construct ngrams.
  • windowSize must be greater than or equal to windowSize.
  • if windowSize is the same as nGramSize, all ngrams are comprised of adjacent words.
  • if windowSize is greater than nGramSize, all non-adjacent word sets of length nGramSize are extracted from each set of windowSize words.

Example: nGramSize=2, windowSize=3, text="a quick brown fox".

The first window is "a quick brown". The ngrams are "a quick", "a brown", and "quick brown".

The second window is "quick brown fox." The ngrams are "quick brown", "quick fox", and "brown fox".


NGramExtractor

public NGramExtractor(java.util.ArrayList wordList,
                      int nGramSize,
                      int windowSize)
Create NGram analysis from an arraylist of words.

Parameters:
wordList - The arraylist with the words.
nGramSize - The number of adjacent words forming an ngram.
windowSize - The window size (number of words) within which to construct ngrams.

NGramExtractor

public NGramExtractor(java.lang.String fileName,
                      java.lang.String encoding,
                      int nGramSize,
                      int windowSize)
Create NGram analysis of a text file.

Parameters:
fileName - The file containing the text to analyze.
encoding - The encoding for the text file (.e.g, "utf-8").
nGramSize - The number of adjacent words forming an Ngram.
windowSize - The window size (number of words) within which to construct ngrams.

NGramExtractor

public NGramExtractor(WordCountExtractor wordCountExtractor,
                      int nGramSize,
                      int windowSize)
Create NGram analysis from a WordCountExtractor.

Parameters:
wordCountExtractor - The WordCountExtractor containing the words to analyze.
nGramSize - The number of adjacent words forming an Ngram.
windowSize - The window size (number of words) within which to construct ngrams.
Method Detail

generateNGrams

protected void generateNGrams()
Generate NGram analysis from string array of words.


mergeNGramExtractor

public void mergeNGramExtractor(NGramExtractor extractor)
Merge ngrams from another NGramExtractor.

Parameters:
extractor - Merge ngrams from another extractor.

getNGramCount

public int getNGramCount(java.lang.String ngram)
Return count for a specific ngram.

Parameters:
ngram - The ngram whose count is desired.
Returns:
The count of the ngram in the text.

getNGrams

public java.lang.String[] getNGrams()
Return NGrams.

Returns:
String array of ngrams.

getNGramMap

public java.util.SortedMap getNGramMap()
Return NGram map.

Returns:
NGram map as a sorted map.

getNumberOfNGrams

public int getNumberOfNGrams()
Returns the total number of ngrams.

Returns:
The total number of ngrams.

getNumberOfUniqueNGrams

public int getNumberOfUniqueNGrams()
Returns the number of unique ngrams.

Returns:
The number of unique ngrams.

splitNGramIntoWords

public static java.lang.String[] splitNGramIntoWords(java.lang.String ngram)
Returns the individual words comprising an ngram.

Parameters:
ngram - The ngram to parse.
Returns:
String array of the individual words (in order) comprising the ngram.