< 

Preface

< 

Table of Contents

< 

Metadata and the Query Potential of the Digital Surrogate


What is WordHoard?

Table of Contents


What is WordHoard?

The WordHoard project is named after an Old English phrase for the verbal treasure 'unlocked' by a wise speaker. It applies to highly canonical literary texts the insights and techniques of corpus linguistics, that is to say, the empirical and computer-assisted study of large bodies of written texts or transcribed speech. In the WordHoard environment, such texts are annotated or tagged by morphological, lexical, prosodic, and narratological criteria. They are mediated through a 'digital page' or user interface that lets scholarly but non-technical users explore the greatly increased query potential of textual data kept in such a form.

It is a basic assumption of WordHoard that new kinds of historical, literary, or broadly cultural analysis will be supported through the forms of data access that are made possible when literary texts are treated in the manner of linguistic corpora. Deeply tagged corpora of course support more finely grained inquiries at a verbal or stylistic level. But more importantly, access to the words of a text at such microscopic levels also lets you look in new ways at the imaginative worlds created by those words.

In its current release WordHoard contains the entire canon of Early Greek epic in the original and in translation, as well as all of Chaucer, Shakespeare, and Spenser. The section on Provenance, Copyrights, and Licenses provides detailed information about the texts.


Going from the Word Here to the Words There

WordHoard is a philological tool. ‘Philological’ is an old-fashioned but useful term to define inquiries into the uses of words. A ‘philologist’ is literally a ‘lover of words’, but whether philologists are motivated by love, duty, or a combination of both, they characteristically pay attention to the ways in which words work. In the world of language, Usage is King. Words are coins whose value is confirmed or subtly changed every time they change hands. If I am puzzled about how a word is used here it helps to look at how it was used there—especially if ‘there’ is not distant in space or time from ‘here’. Going from the word here to the words there is a core philological activity.

Devices for making that activity easier have a long history, including famously the Biblical concordance invented by medieval monks and the Lazy Susan devices Humanist scholars used to manage more than one book at the same time. But not even Thomas Jefferson’s model of this device would help you look simultaneously at different pages of the same book.

Much design work in WordHoard has gone into creating an interface that makes it easy and pleasing to go from the word here to the words there. The interface consists of elegant alternatives to the medieval monk’s concordance and to Jefferson’s Lazy Susan. How do I find the ‘words there’ and how do I move effortlessly from here to there?

To begin with the latter, the interface lets the reader display arbitrarily chosen pages side by side and keep them virtually in the same field of vision.

How do I find the ‘words there’? That is the concordance problem, solved with varying degrees of speed and complexity since the medieval Biblical concordance. Computers are good at generating keyword-in-context displays (KWIC) and going from any line in a KWIC list to a wider context. Mark Olsen’s Philologic search engine does this with lightning speed for vast text archives such as the Text Creation Partnership collection of early modern English texts (TCP).

WordHoard is not particularly fast at the task of delivering the raw output of a concordance search. Its power derives from its metadata, the meticulous cataloguing of each word occurrence by different criteria that can subsequently be used as search parameters.


Looking for Love

Let us, in the manner of Latin teachers, use 'love' as an example and look for 'love' in Chaucer, Spenser, and Shakespeare. For this exercise keep in mind the important distinctions between a 'spelling', a 'word form', and a 'lemma'. The word 'love' can take different forms, such as 'loves', 'loveth', 'loving', 'loved'. A dictionary will give you the most common form of a word, the base form of a verb or the singular of a noun. The technical term for this dictionary entry form, which bundles all the other forms, is 'lemma', and 'lemmatization' refers to the process of bundling different forms of a word under its lemma. A 'word form' refers to a state of a lemma that has some conditions imposed on it. Thus in 'loved' the -d suffix specifies pastness. The different forms of a word are known as its 'morphology'. A 'spelling' is just what it says it is: a sequence of letters that stand for a particular word form (and sometimes for more than one). 'Loved', 'loued', 'louyd' are different ways of spelling what could be one of two word forms of the lemma 'love': the past tense or the past participle.

For any word occurrence in a WordHoard text, the program knows what word form and lemma the spelling in that location refers to, and it also knows what other spellings of that word form or what other word forms of that lemma exist in the corpus. The practical implication of this fact is that a search for a lemma will retrieve all spellings of all word forms.

In the Canterbury Tales the verb 'love' occurs first in the description of the knight as a man

That fro the tyme that he first bigan

To riden out, he loved chivalrie,

If you click on 'loved' and then follow the information about word forms you see at once what the grammarians used to the call the 'paradigm' of the verb:

Part of Speech Spelling
vvi (122) love (104)
loven (18)
vvb (101) love (101)
vvd (86) loved (65)
lovede (20)
love (1)
vvz (72) loveth (72)
vvn (26) loved (24)
iloved (1)
yloved (1)
n-vvg (25) lovyng (13)
lovynge (12)
vvp (24) loven (24)
vv2 (5) lovest (5)
vv2-imp (4) loveth (4)
j-vvg (2) lovynge (2)
vvd2 (2) lovedest (2)
vvdp (2) loveden (2)

There is, however, one big and useful difference between this table and a paradigm in a traditional grammar. Instead of seeing a systematic representation of legal forms, you see the frequency distribution of actually occurring forms, which in this particular case lets you see almost at once that Chaucer's morphology is predominantly 'modern' rather than 'medieval'. The modern forms 'love' and 'loved' are much more common than the Middle English forms 'loven', 'lovede', 'loveden' and 'yloved'. You come to the same conclusion by looking at the distributions of more common verbs with greater morphological and orthographic variance, such as 'say', 'have', or 'be'.

Sometimes the different spellings of a word will throw a light on its origin. Take the opening line of the Canterbury Tales: "Whan that Aprill with his shoures soote." The meaning of the last word is not obvious. It is glossed as 'sweet'. If you see that the more common spellings of that word are 'swote' (18) and 'swoote' (8) you recognize instantly that it is in fact a dialectal variant of 'sweet'.

Spenser used a lot of spellings that were idiosyncratic and deliberately archaic by the standards of his own day. For that reason, modern editions of his poems are always printed with their original spellings. WordHoard lets you capture his orthographic variance. For 'love' this turns out to be not especially interesting. You see quickly that he usually spells 'love' 'loue'. But so do all his contemporaries. 'Bloody' is more interesting: of the 112 occurrences of that adjective only two are spelled the modern way, and the preferred spellings are 'bloudy' (56), 'bloudie' (43), 'bloodie' (6), and 'blouddy'(4). This simple example also shows that you would need fairly complex 'wildcard' or 'regular expression' searches to capture orthographic variance with precision.

With the WordHoard Shakespeare text, the utility of displaying orthographic and morphological variance diminishes somewhat. It is a modern spelling edition to begin with, and except for the second person singular, Shakespeare's morphology is no richer than that of modern English. Lemmatization remains useful in distinguishing the different word forms or lemmata represented by such spellings as 'may', 'art', 'will', 'like' and so forth. It is easy and useful to look for the 92 occurrences of the noun 'art' without having to worry about the 906 occurrences of the verb form 'art'.

But the metadata in WordHoard let you ask additional questions about word usage. Do men and women differ in the frequency with which they talk about 'love'? Is 'love' more often the subject of verse than prose?

If you know your way around WordHoard, fifteen minutes' work will let you construct a table like the following:

Categories All words love: noun (per 10K) love: verb (per 10K)
Men 663,000 811 (12.2) 1033 (15.6)
Women 141,000 251 (17.8) 372 (26.4)
Verse 653,000 872 (13.3) 1434 (21.9)
Prose 203,000 263 (13.0) 245 (12.1)

This corroborates in more precise terms what you might have expected. Men speak a lot more than women (the ratio is 4.7) and verse is more common than prose by a ratio of 3.2. A little eighth grade math shows you that, relatively speaking, Shakespeare's women are about 50% more likely to speak of love than men and that the noun 'love' is almost twice as common in verse as in prose.

But this may be too coarse an analysis. Women speak relatively more in comedy (the male: female ratio is about 3:1), and love is a big topic in comedy. Do women in comedies use the word love more often than men? You can find that out easily.


About the Distribution and Frequency of Words

The British linguist J. Firth famously remarked that you shall know a word by the company it keeps. Implicit in the concept of company is the concept of frequency. What places does a word 'frequent', to use the word in an old-fashioned sense, and in the company of what other words? In everyday discourse we are exquisitely sensitive to the relative commonness or rareness of words, although we rarely bother, and indeed are typically unable or reluctant, to express our perceptions in explicitly quantitative terms. But frequency is a very powerful property of a word, and it is an excellent measure of its currency and usage, which are the ultimate determinants of its meaning.

Computers are not exquisitely sensitive, but they are very good at keep accurate track of where and how often words are used, and they can this very fast. What they keep track of is typically a lot cruder than what the attentive and competent human listener or reader picks up along the way. But crude as the information is, it can be grouped or sorted very quickly, and the results of such operations often provide useful evidence for a number of inquiries.

A quite striking demonstration of the power of very crude figures is given by the following table, which simply lists the ten most common nouns in Homer, Chaucer, Spenser, and Shakespeare in descending order of occurrence:

HomerChaucerSpenserShakespeare
manmanknightlord
shipthingmanman
godgodhandsir
spiritheartladylove
handlovedayking
sondaywayheart
horsefolklifeeye
fathertimehearttime
wordwordplacehand
companionmannersightfather

Separately and in combination these simple lists can be made to tell quite powerful stories about what are or are not dominant topics in the various authors. It is difficult to think of a better preliminary guide to the study of Homer than to keep in mind the three top nouns: man, ship, and god. The word I translated as 'spirit' is the Greek 'thymos', which stands for some inner organ or energy that is very difficult to define or describe. A comparison of the lists suggests that it roughly takes the place that in the English authors is occupied by 'heart'.

In WordHoard we have taken great care to put frequency information within easy reach of users and in such a manner that it can be used without requiring more than basic numeracy. If you look up a word, you are told how often it occurs in the work at hand and in the larger corpus of which it is a part. The raw counts are also expressed as relative frequencies per 10,000 words, which makes it easier to compare counts. Word rank can be another useful measure. To return to 'love', only three nouns occur more often in Shakespeare than 'love', and one can make something of the fact that they are the cluster 'lord', 'man', and 'sir'. A final measure is the distribution of a word across different works. A technical term for this is 'document frequency'. If a word occurs twenty times in a corpus, it makes a considerable difference whether it is found twenty times in one work or once in twenty works.


< 

Preface

< 

Table of Contents

< 

Metadata and the Query Potential of the Digital Surrogate