edu.northwestern.at.wordhoard.swing.calculator.analysis
Class FindMultiwordUnits

java.lang.Object
  extended by edu.northwestern.at.wordhoard.swing.calculator.analysis.FrequencyAnalysisRunnerBase
      extended by edu.northwestern.at.wordhoard.swing.calculator.analysis.FindMultiwordUnits
All Implemented Interfaces:
AnalysisRunner

public class FindMultiwordUnits
extends FrequencyAnalysisRunnerBase
implements AnalysisRunner

Find multiword units.


Field Summary
protected  int accepted
          Count # of mwus accepted.
protected  int acceptedByLocalMaxs
          Count of mwus accepted by localmaxs algorithm.
protected static int DICECOLUMN
           
protected static int LOGLIKECOLUMN
           
protected static int MWUCOUNTCOLUMN
           
protected static int MWULENGTHCOLUMN
           
protected  int mwusToReportOn
          Count of mwus to report on.
protected static int MWUTEXTCOLUMN
          Output column indices.
protected  int onceOnly
          Count # of mwus which occur only once.
protected static int PHISQUAREDCOLUMN
           
protected  int rejected
          Count # of mwus rejected by filters.
protected  int rejectedByWordClassFilters
          Count of mwus rejected by word class filters.
protected static int SCPCOLUMN
           
protected static int SICOLUMN
           
protected  int sortColumn
          The column containing the association measure to use.
protected static int WORDCLASSESCOLUMN
           
 
Fields inherited from class edu.northwestern.at.wordhoard.swing.calculator.analysis.FrequencyAnalysisRunnerBase
adjustChiSquareForMultipleComparisons, analysisText, analysisTextBreakdownBy, analyzePhraseFrequencies, associationMeasure, blankReplacementCharacter, collocationOccurrenceMap, colorCodeOveruseColumn, compressValueRangeInTagClouds, contextButton, cutoff, displayProgress, filterBigramsByWordClass, filterMultiwordUnitsContainingVerbs, filterOutProperNames, filterSingleOccurrences, filterTrigramsByWordClass, filterUsingLocalMaxs, FONT_SIZE, frequencyAnalysisType, frequencyNormalizationMethod, FrequencyProfileResults, ignoreCaseAndDiacriticalMarks, leftSpan, markSignificantLogLikelihoodValues, maximumMultiwordUnitLength, minimumCount, minimumMultiwordUnitLength, minimumWorkCount, model, percentReportMethod, pluralWordFormString, progressReporter, referenceText, referenceTextBreakdownBy, resultsPanel, resultsScrollPane, resultsTable, rightSpan, roundNormalizedFrequencies, showPhraseFrequencies, showWordClasses, tableSelectionListener, useShortWorkTitlesInDialogs, useShortWorkTitlesInHeaders, useShortWorkTitlesInOutput, useShortWorkTitlesInWindowTitles, wordForm, wordFormString, wordOccs, wordToAnalyze
 
Constructor Summary
FindMultiwordUnits()
          Create a multiple word form frequency profile object.
 
Method Summary
 boolean areResultOptionsAvailable()
          Are result options available?
protected  java.util.Collection createRawMWUs(WordCountExtractor wordExtractor, NGramExtractor[] extractors)
          Create raw (unfiltered) multiword unit strings.
protected  java.lang.String[] extractLemmata(java.util.List workWords)
          Extract lemmata from retrieved data.
protected  java.lang.String[] extractSpellings(java.util.List workWords)
          Extract spellings from retrieved data.
protected  java.lang.String[] filterMultiwordUnits(java.util.List mwuCountData, java.util.HashMap glueMap, java.util.Map wordCountMap, NGramExtractor[] extractors, SortedTableModel model)
          Filter the raw multiword units.
protected  java.lang.String fixMWUText(java.lang.String mwuText)
          Fix multiword unit text for display.
protected  ResultsPanel generateResults(WordHoardSortedTableModel model, java.lang.String[] maxLabels, int sortColumn, int totalWordCount)
          Displays results of multiword unit extraction in a sorted table.
 ResultsPanel getCloud()
          Show tag cloud of Dunning's log-likelihood profile.
protected  double getGlue(java.lang.String mwuText, java.util.Map glueMap)
          Get "glue" value for a multiword unit.
 LabeledColumn getResultOptions()
          Return result options.
 boolean isCloudAvailable()
          Is cloud output available?
protected  boolean isMWU(MultiwordUnitData countData, java.util.Map glueMap)
          Determine if multiword unit is a phrase using localmaxs.
protected  boolean passesBigramFilter(java.lang.String[] wordClasses)
          Filter bigrams by word class.
protected  boolean passesTrigramFilter(java.lang.String[] wordClasses)
          Filter trigrams by word class.
protected  boolean passesVerbFilter(java.lang.String[] wordClasses)
          Filter ngrams containing verbs.
 boolean passesWordClassFilters(java.lang.String[] words)
          Filter multiword units using major word class.
protected  java.util.List retrieveLemmata(Work work)
          Perform query and get lemmata for selected work(s).
protected  java.util.List retrieveSpellings(Work work)
          Perform query and get spellings for selected work(s).
 void runAnalysis(javax.swing.JFrame parentWindow, ProgressReporter progressReporter)
          Run an analysis.
protected  java.lang.Object[] storeMWUData(java.util.Collection mwusList, java.util.Map wordCountMap, int totalWordCount, NGramExtractor[] extractors)
          Store multiword unit data.
 
Methods inherited from class edu.northwestern.at.wordhoard.swing.calculator.analysis.FrequencyAnalysisRunnerBase
closeProgressReporter, createCloudAssociationMeasuresComboBox, createCompressValueRangeInTagCloudsCheckBox, generateResults, getAnalysisPercentColumnName, getChart, getCloud, getColTitleWordFormString, getContext, getDoubleFormat, getPercentReportMethodFormat, getReferencePercentColumnName, getResults, getTableFontSize, getTitle, handleTableSelectionChange, isCancelled, isChartAvailable, isContextAvailable, isFilterAvailable, saveChart, setContextButton, showDialog
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface edu.northwestern.at.wordhoard.swing.calculator.analysis.AnalysisRunner
getChart, getContext, getResults, handleTableSelectionChange, isChartAvailable, isContextAvailable, isFilterAvailable, saveChart, setContextButton, showDialog
 

Field Detail

MWUTEXTCOLUMN

protected static final int MWUTEXTCOLUMN
Output column indices.

See Also:
Constant Field Values

WORDCLASSESCOLUMN

protected static final int WORDCLASSESCOLUMN
See Also:
Constant Field Values

MWULENGTHCOLUMN

protected static final int MWULENGTHCOLUMN
See Also:
Constant Field Values

MWUCOUNTCOLUMN

protected static final int MWUCOUNTCOLUMN
See Also:
Constant Field Values

DICECOLUMN

protected static final int DICECOLUMN
See Also:
Constant Field Values

LOGLIKECOLUMN

protected static final int LOGLIKECOLUMN
See Also:
Constant Field Values

PHISQUAREDCOLUMN

protected static final int PHISQUAREDCOLUMN
See Also:
Constant Field Values

SICOLUMN

protected static final int SICOLUMN
See Also:
Constant Field Values

SCPCOLUMN

protected static final int SCPCOLUMN
See Also:
Constant Field Values

accepted

protected int accepted
Count # of mwus accepted.


rejected

protected int rejected
Count # of mwus rejected by filters.


onceOnly

protected int onceOnly
Count # of mwus which occur only once.


mwusToReportOn

protected int mwusToReportOn
Count of mwus to report on.


rejectedByWordClassFilters

protected int rejectedByWordClassFilters
Count of mwus rejected by word class filters.


acceptedByLocalMaxs

protected int acceptedByLocalMaxs
Count of mwus accepted by localmaxs algorithm.


sortColumn

protected int sortColumn
The column containing the association measure to use.

Constructor Detail

FindMultiwordUnits

public FindMultiwordUnits()
Create a multiple word form frequency profile object.

Method Detail

runAnalysis

public void runAnalysis(javax.swing.JFrame parentWindow,
                        ProgressReporter progressReporter)
Run an analysis.

Specified by:
runAnalysis in interface AnalysisRunner
Overrides:
runAnalysis in class FrequencyAnalysisRunnerBase
Parameters:
parentWindow - Parent window for dialogs in the analysis.
progressReporter - Progress display for analysis.

retrieveSpellings

protected java.util.List retrieveSpellings(Work work)
Perform query and get spellings for selected work(s).

Parameters:
work - Work from which to retrieve words.

retrieveLemmata

protected java.util.List retrieveLemmata(Work work)
Perform query and get lemmata for selected work(s).

Parameters:
work - Work from which to retrieve words.

extractSpellings

protected java.lang.String[] extractSpellings(java.util.List workWords)
Extract spellings from retrieved data.

Parameters:
workWords - Retrieved words.
Returns:
String array of spellings suitable for counting.

extractLemmata

protected java.lang.String[] extractLemmata(java.util.List workWords)
Extract lemmata from retrieved data.

Parameters:
workWords - Retrieved words.
Returns:
String array of lemmata suitable for counting.

createRawMWUs

protected java.util.Collection createRawMWUs(WordCountExtractor wordExtractor,
                                             NGramExtractor[] extractors)
Create raw (unfiltered) multiword unit strings.

Parameters:
extractors - The NGramExtractors to receive the raw multiword unit strings.
Returns:
List of all raw multiword units to analyze.

storeMWUData

protected java.lang.Object[] storeMWUData(java.util.Collection mwusList,
                                          java.util.Map wordCountMap,
                                          int totalWordCount,
                                          NGramExtractor[] extractors)
Store multiword unit data.

Parameters:
mwusList - Collection of all raw multiword units.
wordCountMap - Map containing words and keys and counts as value for all words in the multiword units.
totalWordCount - Total word count in word count map.
extractors - The NGramExtractors holding the counts for the raw multiword unit strings.
Returns:
Two item array. [0] = list of all multiword unit count data items. [1] = hash map mapping mwu to selected association measure for use by localmaxs.

filterMultiwordUnits

protected java.lang.String[] filterMultiwordUnits(java.util.List mwuCountData,
                                                  java.util.HashMap glueMap,
                                                  java.util.Map wordCountMap,
                                                  NGramExtractor[] extractors,
                                                  SortedTableModel model)
Filter the raw multiword units.

Parameters:
mwuCountData - The list of multiword unit count data.
glueMap - Hash map of mwus to glue association measures.
wordCountMap - Word count map.
extractors - Extractors holding mwu count data.
model - Table model in which to store filtered mwus.
Returns:
Longest mwu string in table.

fixMWUText

protected java.lang.String fixMWUText(java.lang.String mwuText)
Fix multiword unit text for display.

Parameters:
mwuText - The multiword unit text to fix.
Returns:
The multiword unit text suitable for display.

isMWU

protected boolean isMWU(MultiwordUnitData countData,
                        java.util.Map glueMap)
Determine if multiword unit is a phrase using localmaxs.

Parameters:
countData - The multiword unit data.
glueMap - The glue map for all multiword units.
Returns:
true if multiword unit appears to be a phrase.

getGlue

protected double getGlue(java.lang.String mwuText,
                         java.util.Map glueMap)
Get "glue" value for a multiword unit.

Parameters:
mwuText - The multiword unit text.
glueMap - The map from multiword units to glue values.
Returns:
The glue value for the given multiword unit. Returns 0 if mwu not found.

passesBigramFilter

protected boolean passesBigramFilter(java.lang.String[] wordClasses)
Filter bigrams by word class.

Parameters:
wordClasses - Major word classes for each word in bigram.

The bigram filters are those suggested by Justeson and Katz.

  • A N
  • N N

A = adjective
N = noun


passesTrigramFilter

protected boolean passesTrigramFilter(java.lang.String[] wordClasses)
Filter trigrams by word class.

Parameters:
wordClasses - Major word classes for words comprising trigram.

The trigram filters are those suggested by Justeson and Katz.

  • A A N
  • A N N
  • N A N
  • N N N
  • N P N

To this we add, for trigrams:

  • N C N

A = adjective
N = noun
P = preposition
C = conjunction


passesVerbFilter

protected boolean passesVerbFilter(java.lang.String[] wordClasses)
Filter ngrams containing verbs.

Parameters:
wordClasses - Major word classes for each word in ngram.

The ngram is filtered if any of the constiuent words is a verb.


passesWordClassFilters

public boolean passesWordClassFilters(java.lang.String[] words)
Filter multiword units using major word class.

Parameters:
words - Major word class for each word in the ngram.
Returns:
true if multiword unit passes word class filters.

The verb filter removes all multiword units containing a verb.


generateResults

protected ResultsPanel generateResults(WordHoardSortedTableModel model,
                                       java.lang.String[] maxLabels,
                                       int sortColumn,
                                       int totalWordCount)
Displays results of multiword unit extraction in a sorted table.

Parameters:
model - Table model holding data to display.
maxLabels - Maximum width value for initial table columns.
sortColumn - Column on which to sort table.
totalWordCount - Total number of words.
Returns:
ResultsPanel with table and title.

isCloudAvailable

public boolean isCloudAvailable()
Is cloud output available?

Specified by:
isCloudAvailable in interface AnalysisRunner
Overrides:
isCloudAvailable in class FrequencyAnalysisRunnerBase
Returns:
true if cloud output available, false otherwise.

areResultOptionsAvailable

public boolean areResultOptionsAvailable()
Are result options available?

Specified by:
areResultOptionsAvailable in interface AnalysisRunner
Overrides:
areResultOptionsAvailable in class FrequencyAnalysisRunnerBase
Returns:
true if result options are available, false otherwise.

getResultOptions

public LabeledColumn getResultOptions()
Return result options.

Specified by:
getResultOptions in interface AnalysisRunner
Overrides:
getResultOptions in class FrequencyAnalysisRunnerBase
Returns:
Result options in a LabeledColumn.

getCloud

public ResultsPanel getCloud()
Show tag cloud of Dunning's log-likelihood profile.

Specified by:
getCloud in interface AnalysisRunner
Overrides:
getCloud in class FrequencyAnalysisRunnerBase
Returns:
ResultsPanel containing the cloud.