WordHoard - Metadata and the Query Potential of the Digital Surrogate

What is WordHoard?

Table of Contents

Working with Very Common and Very Rare Words

Metadata and the Query Potential of the Digital Surrogate

About Metadata and the Query Potential of the Digital Surrogate

WordHoard presents you with surrogates of texts that originated in another medium. The Iliad and Odyssey may have circulated as oral poems before being written down on manuscript scrolls. Two thousand years later these manuscripts were printed as books. Now these books circulate in digital form. I call this the 'allographic' journey of texts. The term comes from the art critic Nelson Goodman, who distinguishes between autographic and allographic works. A painting or sculpture is an autograph. Not only is it tied to a particular embodiment, but it exists most fully in a singular instance that is categorically distinct from any copy. By contrast, a poem or musical composition is not essentially tied to a particular notation: there is an infinite number of arbitrary schemes to represent the text of Hamlet or the score of the Appassionata.

If works are allographic, it should not matter in principle how they are recorded or transcribed as long as the information remains the same. In practice, it is quite hard to separate form from content rigorously. If you are used to 'reading' a text in one way, a new format may disorient you. Or it may simply not be the case that you can change the 'how' of a representation without changing its 'what'. That is certainly the case when a poem moves from an oral into a literate sphere.

Hence my term 'surrogate'. One form of mediation substitutes for another. When you use words like substitute or surrogate you are aware of possible loss. The surrogate, in particular, carries with it the connotation that it is a second best. But looking at a facsimile of the Mona Lisa may tell you more about her famous smile than three pages of dense prose, even if you never have a chance to see the original in the Louvre. Second bests are a lot better than nothing at all.

Surrogates may also be better than the original for certain purposes. If I owned an original First Folio of Shakespeare, I probably would not use it to read the plays. Not only because I would be afraid of damaging it by turning its pages too often but because it is a pretty cumbersome text to handle. A modern edition works much better for that purpose, and if I want to stretch out on a sofa, I would not pick the heavy Riverside Shakespeare but choose a paperback that I can hold comfortably at the right distance.

If you reflect a little on this simple example you will discover that in the modern world virtually all of our encounters with cultural objects turn on surrogate experiences. You also see that the surrogate, far from being a second best, serves better for some purposes than the original. And if you are philosophically inclined you may wonder whether the so-called original is "always already" a surrogate of something else. But you need not travel that 'deconstructive' route to agree with the practical conclusion that even if you had unlimited access to the original you would often choose a surrogate.

The surrogate, then, offers first-order and second-order advantages. The first-order advantage consists in its "better than nothing" quality. The second-order advantages consist of all those features that make you use the surrogate even when you have access to the original. That brings us to the digital surrogate and the question what second-order advantages it offers over the putative original that it replaces in some contexts. That is an important question to keep firmly in mind, for "advantage" is not something that gives itself to you. It is something to be taken, as in "take advantage of". But you cannot take it unless you know what it is and where it is found.

The first-order advantage of the digital surrogate of printed texts is very obvious and very powerful. Given a quite modest computer and a not very fast Internet connection you have virtually free access to more books than you could possible read in a life time, and if you are a moderately watchful user you can easily learn how to judge whether a given surrogate is good enough for the purpose.

The second-order advantage of the digital surrogate of a text is a little harder to grasp, and it requires reflection on the nature of a written page. Such a page is a sequence of encodings addressed to a human reader who brings a set of complex and largely tacit skills to the task of making sense of the frequently underdetermined signs on the page. These skills are tacit, but they are learned. Children learn without formal instruction how to understand and generate spoken utterances. But it takes a great deal of effort to teach a child how to encode such utterances in a graphic medium and how to decode somebody else's encodings. And if you have not learned it as a child it is even harder to learn it as a grown-up. A competent third-grader may be said to be literate for most practical purposes. Such third-graders will have spent a non-trivial amount of time every day for more than half their life mastering this set of skills. A very remarkable achievement it is.

The signs on the page are far from self-evident instructions but require active interpretation by the reader, who 'makes sense' of them, which typically involves resolving ambiguities, supplying contextual information, overlooking errors, and the like. The simplest digital version of such a page merely aims at replicating those signs in such a manner that a human reader can read them in the ordinary manner, whether on a screen or in printed form. But even this simple digital surrogate acquires some second-order advantages.

When a text is digitized it is turned into nested list of characters, words, and lines. Look at the word count feature of Microsoft Word, where you are told instantly that your essay consists of so many characters, words, and lines. Moreover, the computer keeps track of the locations and counts of the items in those lists so that even the most primitive digitized version of a text is an inventory of its words that can be counted and sorted by various criteria. The sequence of words in the text is complemented by a 'bag of words' model which remembers chiefly how many different words there are and how often each occurs. You learn a surprising amount about a text by looking at its bag of words. Googling and other forms of information retrieval are based on reducing large collections of documents to word bags and comparing their contents. The procedure may be deeply offensive to writers who toil over the exact sequence of their words. But for many purposes it works better than it should. And the many ways in which a computer keeps track of the words in a digital text also account for the fact that it is a much easier to find something in a digital file than in a book.

More powerful second-order advantages accrue if you no longer think of the digital file as something that is meant to be read by a human, but think of it instead as a data structure to be processed by algorithms of various kinds so that you can find out some things about the text that you could not discover by just reading it. For this to happen, however, you must make explicit in the data at least some of the readerly knowledge that humans implicitly bring to the task of making sense of words on the page. You do this by surrounding the 'data' with an appropriate set of 'metadata' that make tediously explicit in every instance what the human reader already knows.

If you do this — and much of it can be done automatically with reasonable degrees of accuracy — the digital surrogate supports searches that would be difficult or practically impossible to do in a print environment. If you tag every occurrence of an adjective with a symbol that says 'I am an adjective', you do nothing for the reader qua reader, but you make it a trivial task to extract a list of adjectives. If you believe that the distribution of adjectives tells you something about a writer's values and habits that may be a useful thing to have. if you mark every speech of Ophelia as a speech by her and do the same for the other characters in Hamlet it becomes a trivial task to compare adjectives used by Ophelia with those used by Hamlet. This can be done very quickly and will sometimes produce useful insights or provide helpful corroboration for conclusions intuitively reached. Evidence is a good thing.

Metadata exist at different levels. A catalogue entry for a book is a kind of metadata and provides explicit summary information about the document as a whole. Part-of-speech tagging provides metadata at the molecular level of word occurrence. You may want to think of it as a way of 'cataloguing' each word occurrence of a book. At an intermediate level, metadata may catalogue speeches, scenes, acts, or chapters in a play or novel, different types of rhyme, or other phenomena that human readers ordinarily observe and process tacitly. It is important to keep in mind that metadata from different levels of a document hierarchy can be chained together in various ways to support complex queries.

The metadata for the texts in WordHoard are unusually rich. The texts are highly canonical, they continue to be analyzed by many scholars from different perspectives, and it is therefore worthwhile cataloguing data about them in a detailed and systematic fashion. John Unsworth, the Dean of the Library School at the University of Illinois, has coined a witty acronym for the power of metadata. He calls it the MONK principle, which stands for Metadata Offer New Knowledge. When you analyze a text in an environment like WordHoard, it is useful to keep in mind that the queries you formulate typically are not run against the data, but against the metadata. And if you are curious about a given phenomenon, the answers will often come from thinking about the metadata and the ways in which different types of them can be combined in queries.

Levels of Metadata in WordHoard

Morphological tagging of Early Greek epic

The morphological tagging of Early Greek epic is based on the Morpheus module of Perseus. Morpheus assigns to each ancient Greek spelling its possible morphological descriptions. A description consists of some combination of tense, mood, voice, case, gender, person, and number. The Morpheus descriptions served as the basis for disambiguation routines that assign to each word occurrence in Early Greek epic the actual morphological function the word has in that place.

NUPOS: A hybrid scheme to accommodate Chaucer, Spenser, and Shakespeare

WordHoard employs a tagging scheme named "NUPOS" that tries to capture major morphosyntactic features from Chaucer to Modern English. For a detailed description of NUPOS see the PDF document:

NUPOS: A part of speech tag set for written English from Chaucer to the present.

The NUPOS scheme is a hybrid. One of its sources is Larry Benson's scheme for tagging Middle English. The other is the CLAWS tagger developed at Lancaster University and used for the tagging of the British National Corpus.

There are advantages and disadvantages of using a common scheme for Chaucer, Spenser, and Shakespeare. To begin with the disadvantages, you lose some precision in the treatment of Middle English verb forms. On the other hand, there are clear advantages to a scheme that lets you compare Chaucer with Spenser or Shakespeare. Some of the disadvantages are minimized by our decision to present morphological data for Chaucer in Larry Benson's and the hybrid scheme. Benson's morphological data appear as glosses in the text, but you cannot use them as the basis for searches.

The hybrid scheme has close to 280 tags. More than a third of them are used very rarely, and the great majority of word occurrences are captured by some three dozen tags. There may be virtue in a coarser scheme that employs between 80 and 100 tags.

How to read the hybrid tags

The tag for any word in Chaucer and Shakespeare provides information about between one and three things:

the syntactic function of the word in its current context
inflectional state (if any)
the word class to which the word belongs in all or most of its uses

The Windows menu of WordHoard takes you to a complete list of abbreviations and explanations. Here is a brief account of how the tags work.

The first part of the tag always tells you how the word is used in the current context. If you see the tag 'j', you learn that the word is an adjective and is used as an adjective here. If you see the tag 'av-j', you learn that an adjective is used adverbially, as in 'beautifully' or 'fast' as in 'run fast'.

Information about the inflectional state of a word comes second. Thus 'n1' marks the singular of a noun, and 'n2' marks the plural. Tags like 'jc' or 'js' mark comparative and superlative forms of adjectives.

You will notice that some tags have hyphens and others do not. The hyphen typically points to the fact that a word is used in one of several possible ways. For instance, the tag 'vvg' identifies the '-ing' form or present participle a verb. If you see the tag 'vvg' by itself you know that an '-ing' form is used as a participle. But if you see 'j-vvg' you learn that a participial form is used as an adjective, as in 'my loving lord'.

Hyphenated forms also serve the function of specifying the range of error. Computer-generated part-of-speech tagging is a pretty error-prone business: on average three out of a hundred words are classified wrongly, and a standard printed page of of 400 words will have a dozen errors. A book with a dozen spelling errors per page would not strike you as a carefully printed book. But correcting errors by hand is prohibitively time consuming. If you use part-of-speech tagging you have to be aware that it is a pretty rough business, but better than nothing. One way of containing error is to mark the range within which it will occur. Words like 'since' or 'as' can function as conjunctions, adverbs, or prepositions, and it is not always to tell these different uses apart. When you see a tag like 'p-acp', you know that the word belongs to a class of words that hover between adverbial, conjunctive, and prepositional uses. That is the point of the information after the hyphen. The letter before the hyphen represents the tagger's best guess about how the word is used here.

There are several types of words that hover between different syntactic uses. There are words that can be used equally as adjectives or nouns. Color words are a good example. Such words have been assigned to the word class 'jn'. Another group of words hover between adverb and noun: 'home' and 'today' are good examples, and they are classified as 'an'. Some names work equally well as nouns or adjectives (Christian, Florentine, Mahometan), and they have been classified as 'jp'.

For more detailed information, look at the section on Parts of Speech and Word Classes in this manual or consult the Parts of Speech table, which you can find in the Windows menu of WordHoard.

Lemmatization

The commitment to a common tagging scheme for Chaucer and Shakespeare implies a commitment to common lemmatization. Benson's tagging scheme and lemmatization are preserved in the glosses that a reader sees for each word in the footer of the page or in the Get Info window. Benson's lemmatization is based on Middle English practice and dictionaries. Thus you see 'percen' as the lemma for the form 'perced'.

The lemmatization for the hybrid morphological scheme tries, wherever possible, to establish common lemmata for Chaucer and Shakespeare. Benson himself points the way for this practice by linking a majority of his lemmata to dictionary entries in the OED.

Lemmatization is a rough way of categorization with fuzzy boundaries. For a native German speaker, the last word of the first line of the Canterbury Tales (shoures soote) is obviously a dialectical variant of 'sweet' and a form of the adjective that he knows as 'süss'. And a case could be made for bundling 'soote' and 'sweet' under the same lemma. But for good reasons the OED does not do so, and I have generally tried not to be smarter than the OED on such matters.

On the other hand, if you lemmatize, you will have to make some choices, and I have a strong preference for erring on the side of bundling rather than splitting. Take the word 'helôria' from the fourth line of the Iliad. In Liddell-Scott-Jones, the OED for ancient Greek, this is referred to the lemma 'helôrion,' and we learn that 'helôrion = helôr.' The lemma 'helôr' is given full treatment, including a translation as 'spoil, prey.' The combination of editorial and typographic practice leaves it conveniently open whether 'helôrion' and 'helôr' are independent lemmata.

Are there one or two lemmata here? Problems of this kind occur in about 2% of dictionary entries and something like 0.025 percent of word occurrences. My hunch is that for most users, lumping will produce more informative results. If, for example, 'helôria' is seen as a form of the lemma 'helôrion,' a user clicking on its sole occurrence in Iliad 1.4 would be informed that it is a hapax legomenon. By classifying it as a form of 'helôr,' the user who clicks on the lemma sees immediately that the eight Homeric occurrences are divided as follows: 'helôr' (8), 'helôra' (1), and 'helôria' (1). Given the distribution of these particular forms it seems much more plausible to argue for a single lemma than to assume that there is a third-declension lemma 'helôr' and a first-declension lemma 'helôrion.' But regardless of the judgment on a particular case, the policy of lumping problematic cases has the advantage of producing errors of the 'false positive' kind, which are easier to spot than the 'false negatives' that would follow from a policy of splitting.

By and large, then, the lemmatization followed in WordHoard has a bundling tendency and proceeds from the assumption that errors of bundling are more transparent to the user than errors of splitting.

Errors and ways of reporting them

If you spot an error you will do other users a favor by reporting it. We have made it very simple to do so. Select the word that you believe to be wrongly tagged and choose Send Error Report from the File Menu. This will trigger an email with the precise location of the word. It is not necessary to explain the error. The fact that a user has reported a mistake about this particular word occurrence will be enough to trigger analysis.

You can also communicate with the WordHoard team by sending email to Professor Martin Mueller.

The morphological data in Early Greek epic have been available through the Chicago Homer for several years. Users have made many corrections, and several users have combed through particular books line by line. Residual errors are few, and quantitative analyses of morphological data are extremely unlikely to be affected by them.

Larry Benson's morphological data are very accurate. Errors in the hybrid tagging data for Chaucer are in nearly all cases the results of mistakes in translating Benson's to the new scheme.

The tagging for Spenser and Shakespeare have greater residual error rates.

The Shakespeare texts and some of the Spenser texts were originally tagged with the CLAWS tagger, and the resultant data were transformed into the hybrid scheme through a series of semi-automatic routines. Some distinctions not made by CLAWS — e.g. the use of 'that' as a conjunction or relative pronoun were made manually by undergraduate teams. Despite several rounds of checking, errors remain. Many of the residual errors have to do with grammatical words and the categorization of participial forms as verbal or adjectival.

Some of the Spenser texts were tagged with the Northwestern University MorphAdorner program.

The more errors you spot and report the fewer there will be.

Narratological metadata

When you read a play, stage directions and speaker identifications are tacitly processed by you as metadata. In a properly digitized text of a play, there are explicit start and end tags that distinguish all the words spoken by Ophelia from the words of Hamlet or any other speaker. If you classify a speaker by gender and mortality, as we have done, you have a handle on all the words spoken by female goddesses in the Iliad or ghosts in Shakespeare.

The WordHoard texts have quite rich data of this kind for Early Greek epic and Shakespeare. The Chaucer and Spenser data are simpler. There are practical reasons for this. Printed editions of Shakespeare distinguish between speakers, as is common in printed drama. In Homer a speech never begins in the middle of a line, and every speech is formally introduced and terminated. Thus it is a quite straightforward matter to distinguish between narrated and spoken lines or lines spoken by X and Y.

It is much harder in Chaucer, as witnessed in the famous line about Alisoun in The Miller's Tale:

"Tehee!" quod she, and clapte the window to.

That is one reason why in narrative poetry or in fiction spoken dialogue is hardly ever identified as such in digital transcriptions, even though it would be quite desirable to do so. It is a very labour-intensive job. How would you train a computer to recognize from various clues that the speaker of 'Tehee' is Alisoun — a fact that is blindingly obvious to minimally competent readers?

Prosodic metadata

Prosodic metadata tell you about the metrical status of a given utterance. Is it metrically bound (poetry) or not (prose)? If it is poetry, is it in the default metre — e.g. blank verse in drama — or does it have some special features of rhyme or stanza structure.

This is a simple matter in a corpus where everything follows the same pattern. Every line in The Faerie Queene is part of the same type of Spenserian stanza. Every line in Early Greek epic is a hexameter. But the Chaucer corpus is very mixed. And in Shakespeare you find not only the distinction between verse and prose, but between different kinds of verse. Many of the prosodic features of Chaucer and Shakespeare are captured in the metadata and can be used as the basis for searches.

Metrical parsing of Early Greek epic

All the hexameters in Early Greek have been parsed. WordHoard 'knows' the metrical shape of each line and of every word in each line. You can make metrical word shape a search criterion.

The metrical scheme for Homeric hexameters is tediously verbose, but has the advantage of being readily processed by a computer. Each hexameter divides into six 'metra' or feet, and each metron divides into two halves, of which the first is always long and the second may be either long or short. Thus the position of a syllable in the metrical pattern can be exactly specified, with a three-digit code. The first digit has values between 1 and 6 and defines the metron. The second digit has the values 1 or 2 and defines the first or second half of the metron. The third digit has the values 1, 2, or 0 and defines the quality of the second half. It is 0 if it consists of one long syllable. if it consists of two short syllables, they are identified as 1 or 2. The opening line of the Iliad scans as follows:

mê-ni	na-ei-de	the-a	Pê-lê-ï-a--dô	A-chi-lê-os
110-121	122-210-221	222-310	320-410-421-422-510	521-522-610-620

And so on for 32,468 lines of Early Greek epic.

What is WordHoard?

Table of Contents

Working with Very Common and Very Rare Words