< 

Parts of Speech XML File

< 

Table of Contents

< 

Standard Spelling XML Files


Work XML Files

Work XML files define the structure, text, tagging data, and visual rendering of the individual works. WordHoard uses a subset of the Text Encoding Initiative (TEI) standard together with extensions which specify information specific to WordHoard.

The file WordHoardText.xsd in the misc directory is a schema which can be used to validate work files.

The best way to learn the details of WordHoard's XML format for works is to study examples along with the formal specification given below. Download the following sample files. Open them with your favorite text editor or XML editor. Position the windows next to your browser window. As you read the descriptions of the individual elements below, look at how each element is used in the examples.

ham.xml: Shakespeare's Hamlet.
faq.xml: Spenser's The Fairie Queene.
IL.xml: Homers's The Iliad.

Hamlet illustrates most of the elements and attributes. The Fairie Queene illustrates stanza numbering, Spenser indentation, and the hi element. The Iliad illustrates unrendered cast of character pages, unrendered speeches, original language speaker names, and Greek text.

WordHoard supports styled text. The rend="style" attribute is used to set styles. The style value may be any of the following:

  • bold. Boldface.
  • italic. Italics.
  • extended. Extended style, with extra space between characters.
  • sperrtext. Same as extended.
  • underline. Underline.
  • overline. Overline.
  • macron. Same as overline.
  • superscript. Superscript.
  • subscript. Subscript.
  • monospaced. Monospaced font.
  • normal. Normal text with none of the above styles.
  • roman. Same as normal.
  • plain. Same as normal.

In the main body of text for our current corpora, we use only the normal, bold and italic styles. The Iliad Scholia also use the sperrtext, macron, superscript, and monospaced styles.

Text lines can also be left-justified, centered, or right-justified using the align="alignment" attribute. The alignment value may be either of the following:

  • left. Left-justified.
  • center. Centered.
  • right. Right-justified.

Text lines can be indented using the indent="nnn" attribute. The indentation "nnn" is measured in pixels. If align="center" or align="right" is specified, however, the indent attribute is ignored.

A WordHoard XML input file for a work has the following elements:

  • WordHoardText. The root element.

    Children:

    • wordHoardHeader (1). WordHoard header.
    • teiHeader (1). TEI header.
    • text (1). The work parts and their text.
  • WordHoardText/wordHoardHeader. WordHoard header for the work.

    Attributes:

    • corpus = the corpus id. Required. This id must match a corpus id defined in the corpus XML file. Multiple ids may be listed separated by vertical bars, in which case the first one matched is used. E.g., "sha|emd".
    • work = the work id. Required.
    • prosodic="prose" or "verse". Optional. Establishes a default prosodic attribute for the tagged words in the work.

    Children:

    • pubDate (0..1). The publication date. Omit this child if the publication date is unknown (e.g., in the Early Greek Epic corpus).
    • taggingData (1). The tagging data categories supported by the work.
  • pubDate. Publication date.

    Children:

    • TEXT (1). The publication date, in the form "year" for a single year or "year-year" for a range of years.
  • taggingData. WordHoard tagging data categories.

    Children:

    • lemma (0..1). Present if the work supports lemma tagging.
    • pos (0..1). Present if the work supports part of speech tagging.
    • wordClass (0..1). Present if the work supports word class tagging.
    • spelling (0..1). Present if the work supports spelling tagging.
    • speaker (0..1). Present if the work supports speaker tagging.
    • gender (0..1). Present if the work supports speaker gender tagging.
    • mortality (0..1). Present if the work supports speaker mortality tagging.
    • prosodic (0..1). Present if the work supports prosodic tagging.
    • metricalShape (0..1). Present if the work supports metrical shape tagging.
    • pubDates (0..1). Present if the work supports publication date tagging.

    All of these children themselves have no children. They are typically specified as empty elements. E.g., <lemma/>.

  • teiHeader. TEI header.

    Children:

    • fileDesc (1). File description.
  • fileDesc. File description.

    Children:

    • titleStmt (1). Title statement.
    • publicationStmt (0..1). Publication statement.
  • titleStmt. Title statement.

    Children:

    • title (1). The work's full title. Rendered at the top of the title page, centered, bold, and in a large font size.
    • shortTitle (0..1). The work's short title. If a short title is not specified, the short title is set to be the same as the full title. Short titles are used in concordances and other contexts.
    • author (1..n). The work's author(s). Rendered one per line on the title page following the title, centered, bold, in a normal font size, with a blank line separating the title and author(s). Each author name must match an author name in the author XML file.
    • respStmt (0..n). Responsibility statements.
  • titleStmt/title. Full work title.

    Children:

    • TEXT (1). The full work title. If this title is longer than 50 characters, it is truncated to 50 characters.
  • titleStmt/shortTitle. Short work title. If this title is longer than 50 characters, it is truncated to 50 characters.

    Children:

    • TEXT (1). The short work title.
  • titleStmt/author. An author.

    Children:

    • TEXT (1). The author's name.
  • respStmt. Responsibility statement.

    Children:

    • name (1). Name(s) of people. E.g., "Craig A. Berry".
    • resp (1). Responsibility. E.g., "editor".

    Responsibility statements are rendered on the title page after the title and authors, centered and in a small font size, with a blank line above and below.

  • respStmt/name. Name(s) of people.

    Children:

    • TEXT (1). The names.
  • respStmt/resp. Responsibility.

    Children:

    • TEXT (1). Responsibility.
  • publicationStmt. Publication statement.

    Children:

    • p (0..n). Untagged lines and paragraphs of styled text.

    Publication statements are rendered on the title page after the title, author, and responsibility statements, centered and in a small font size, with a blank line above and below each paragraph.

  • text. The work parts and their text.

    Children:

    • front (0..1). Front matter work parts.
    • body (0..1). Body work parts.
  • front. Front matter work parts.

    Children:

    • div (0..n). Work parts.
  • body. Body work parts.

    Children:

    • div (0..n). Work parts.
  • div. A work part.

    Each div element defines one node of the work part tree. The work part tree is defined in order as follows:

    • The title page.
    • The div descendants of WordHoardText/text/front
    • The div descendants of WordHoardText/text/body

    Attributes:

    • id. Required. The id of the work part. This id is combined with the corpus id and the work id to form the unique work part reference tag. The work part tag is corpusId-workId-wordPartId. E.g., sha-ham-1-2 for Act 1, Scene 2 of Hamlet.
    • type="castList". Optional. Front matter div children with type="castList" are treated specially. Any castItem descendants of such div elements are used to define speakers and their tagging attributes (gender and mortality).
    • numberingStyle="stanza". Optional. If present, stanza numbering is used for the work part. The default is line numbering.
    • indent = Optional left margin indentation in pixels. WordHoard's default left margin for text is quite close to the left edge of work windows. This works well for scenes in Shakespeare plays, where speeches are rendered with the speaker names flush with the left margin and the bodies of the speeches indented. For other work parts where all or most of the text is flush with the left margin, a larger margin is more attractive. We use a value of indent="20" for this kind of text.
    • rend="none". Optional. If type="castList" and rend="none", the div element is used to define speakers, but it is not used to create a work part or generate a cast of characters page. We use this in the Early Greek Epic corpus to "invisibly" define speakers.

    Children:

    • wordHoardHeader (1). WordHoard header for the work part.
    • lg (0..n). Line groups.
    • sp (0..n). Speeches.
    • wordHoardTaggedLine (0..n). Tagged lines and paragraphs.
    • p (0..n). Untagged lines and paragraphs.
    • head (0..n). Headings.
    • stage (0..n). Stage directions.
    • castList (0..n). Cast lists.
    • div (0..n). Child work parts.
  • div/wordHoardHeader. WordHoard header for a work part.

    Attributes:

    • prosodic="prose" or "verse". Optional. Establishes a default prosodic attribute for the tagged words in the work part.

    Children:

    • title (1). Work part short title. This title is used in WordHoard's table of contents window and in other contexts. E.g., "Scene 3".
    • fullTitle (0..1). Work part full title. If missing, the full title is constructed as a comma-separated list of the titles of the ancestor parts and this part's title. E.g., "Act 2, Scene 3". WordHoard uses full part titles in work window popup menus and other contexts.
    • pathTag (0..1). Work part path tag. Paths composed of part tags are used in concordance windows to identify locations. The path to a work part is constructed by concatenating the work id with the path tags of all the ancestor parts and the work part, separated by periods. For example, the path tag for Hamlet is ham, the path tag for Act 1 is 1, and the path tag for Act 2 is 2. The path to Act 1, Scene 2 of Hamlet is ham.1.2. If a match is found in line 39 of this scene, it is identified in the concordance window with the path ham.1.2.39. If the path tag is not specified, it is omitted when constructing paths.
    • taggingData (1). Enumerates the tagging data categories supported by the work part.
  • div/wordHoardHeader/title. Work part title.

    Children:

    • TEXT (1). The title of the work part. If this title is longer than 50 characters, it is truncated to 50 characters.
  • div/wordHoardHeader/fullTitle. Work part full title.

    Children:

    • TEXT (1). The full title of the work part. If this title is longer than 50 characters, it is truncated to 50 characters.
  • div/wordHoardHeader/pathTag. Work part path tag.

    Children:

    • TEXT (1). The path tag for the work part.
  • lg. A line group.

    Attributes:

    • type="stanza". Optional. If specified, the line group is rendered with a blank line preceding and following the group.
    • n = optional stanza number. This attribute should be specified if type="stanza" is specified.
    • rend="spenser-indentation". Optional. If present, and if type="stanza" is also specified, the line group is rendered in the Spenser style: All but the first and last lines of the group are indented.

    Children:

    • wordHoardTaggedLine (0..n). Tagged lines or paragraphs of text.
    • p (0..n). Untagged lines or paragraphs of text.
    • head (0..n). Headings.
    • stage (0..n). Stage directions.
    • lg (0..n). Nested line groups.
  • sp. A speech.

    Attributes:

    • who. Required. The id(s) of the speaker(s). For multiple speakers, the speaker id's are separated by spaces. These id(s) must match the id(s) defined in the role elements in the cast list.
    • rend="none" or rend="indent". Optional. If missing, the speech is rendered as in Shakespeare: a blank line, the speaker name(s) left-justified, a blank line, then the lines of the speech indented. If rend="none" is specified, the speech is rendered "invisibly" as in the Early Greek Epic corpus: no blank lines or speaker names, and the lines of the speech left-justified. If rend="indent" is specified, no speaker name is rendered, but the lines of the speech are indented (used for a few speeches in Shakespeare which have no rendered speaker names).

    Children:

    • speaker (0..1). If there is no rend attribute, this child is required. If there is a rend attribute, this child must not be present.
    • wordHoardTaggedLine (0..n). Tagged lines or paragraphs of text.
    • p (0..n). Untagged lines or paragraphs of text.
    • head (0..n). Headings.
    • stage (0..n). Stage directions.
    • lg (0..n). Line groups.
  • speaker. Speaker name(s).

    Children:

    • TEXT (1). The speaker name(s) for rendering.
  • wordHoardTaggedLine. A line or wrapped paragraph of tagged styled text. With tagged text, all of the text is specified by child elements. Child text nodes are ignored.

    Attributes:

    • id. Optional. An id for the line. Required if the line has a line number.
    • n. Optional. The line number. With line numbering, this is an integer used to determine whether the line label is displayed in work panels with the "number every fifth line" option. The line label is displayed if and only if the line number is divisible by 5. With stanza numbering, this is the number of the line within the stanza.
    • label. Optional. The line label. Specifies the label displayed in the right margin for line numbering. If it is missing the following rules are used:
      • With line numbering, the n attribute is used.
      • With stanza numbering, if the line is not inside a stanza, the n attribute is used.
      • With stanza numbering, if the line is inside a stanza, s.n is used, where s is the stanza number and n is the line number.
    • prosodic="prose" or "verse". Optional. Establishes a default prosodic attribute for the tagged words in the line.
    • rend="style". Optional. The default style is normal.
    • align="alignment". Optional. The default alignment is left.
    • indent = Optional additional left margin indentation in pixels.

    Children:

    • w (0..n). Tagged words.
    • punc (0..n). Punctuation and spacing.
    • hi (0..n). Change style.
    • title (0..n). Change style to italics.
    • stage (0..n). Stage directions.

    As an example, here's how the first line of Romeo and Juliet is tagged in the roj.xml work definition file:

    
    <wordHoardTaggedLine id="sha-roj100001" n="1">
       <w id="sha-roj10000101"
          lemma="two (nu)"
          pos="crd"                            >Two</w>
       <punc                                   > </punc>
       <w id="sha-roj10000102"
          lemma="household (n)"
          pos="n2"                             >households</w>
       <punc                                   >, </punc>
       <w id="sha-roj10000103"
          lemma="both (da)"
          pos="av-da"                          >both</w>
       <punc                                   > </punc>
       <w id="sha-roj10000104"
          lemma="alike (av)"
          pos="av"                             >alike</w>
       <punc                                   > </punc>
       <w id="sha-roj10000105"
          lemma="in (acp)"
          pos="p-acp"                          >in</w>
       <punc                                   > </punc>
       <w id="sha-roj10000106"
          lemma="dignity (n)"
          pos="n1"                             >dignity</w>
       <punc                                   >,</punc>
    </wordHoardTaggedLine>
    

    Note how the w children (tagged words) alternate with the punc children (untagged punctuation). Also note how the XML has been formatted so that you can read the text vertically on the right: "Two households, both alike in dignity,".

  • p. A line or wrapped paragraph of untagged styled text. With untagged text, text is specified by child text nodes and styles are specified by child hi and title elements.

    Attributes:

    • id. Optional. An id for the line.
    • n. Optional. The line number. This attribute works the same way as in tagged lines.
    • label. Optional. The line label. This attribute works the same way as in tagged lines.
    • rend="style". Optional. The default style is normal.
    • align="alignment". Optional. The default alignment is left.
    • indent = Optional additional left margin indentation in pixels.

    Children:

    • TEXT (0..n). Runs of text with the same style.
    • hi (0..n). Change style.
    • title (0..n). Change style to italics.

    An empty p element can be used to render a blank line: <p/>.

  • head. A heading.

    Headings are rendered with a blank line above and below, and by default they are rendered centered and in boldface. Otherwise this element is identical to the p element, with the same attributes and children.

  • w. A tagged word.

    Attributes:

    • id. Required. The unique id for the word, or "untagged" if the word is not tagged.
    • lemma. Optional. The word's lemma(s). See below.
    • pos. Optional. The word's part(s) of speech. See below.
    • prosodic="prose" or "verse". Optional. The prosodic attribute for the word.
    • metricalShape. Optional. The metrical shape of the word.
    • bensonGloss. Optional, used only for Chaucer. The id of the Benson gloss for the word. This id must match a lemPos id defined in the Benson glosses XML file.

    Children:

    • TEXT (1). The text of the word.

    For untagged words with id="untagged" the other attributes are ignored and are normally not specified. The word is included in the display of the text, but is not tagged with any data.

    The lemma and pos attributes specify the word's morphological tagging data.

    The lemma must be in one of the following formats:

    • spelling (wc)
    • spelling (wc) (hom)

    Where:

    • spelling = The spelling of the lemma.
    • wc = The lemma's word class. This id must match the id of a word class defined in the word class XML file.
    • hom = An optional homonym number.

    Examples:

    love (n)
    lie (v) (1)
    lie (v) (2)
    

    The part of speech must match the id of a part of speech defined in the part of speech XML file.

    For compound words, the lemma and pos attributes list multiple lemmas and parts of speech separated by the vertical bar character (|). For example, the first word of Hamlet is the contraction "Who's". This word is tagged as follows:

    
    <w id="sha-ham10100101"
       lemma="who (crq)|be (va)"
       pos="q-crq|vaz"                      >Who's</w>
    

    The first part of this word is the interrogative pronoun "who." The second part is the primary verb "be," used in the third person singular, present tense.

  • punc. Punctuation and spacing.

    Children:

    • TEXT (1). The punctuation, including space characters.
  • hi. Changes style. These elements may be nested to any depth.

    Attributes:

    • rend="style". Required.

    Children (inside wordHoardTaggedLine elements):

    • w (0..n). Tagged words.
    • punc (0..n). Punctuation and spacing.
    • hi (0..n). Change style.
    • title (0..n). Change style to italics.
    • stage (0..n). Stage directions.

    Children (inside p elements):

    • TEXT (0..n). Runs of text with the same style.
    • hi (0..n). Change style.
    • title (0..n). Change style to italics.
  • title. Changes to italic style. This element is equivalent to <hi rend="italic">.
  • stage. A stage direction.

    Children:

    • TEXT (1). The stage direction.

    Stage directions are rendered in italics, centered, with a blank line above and below.

  • castList. A cast list.

    Children:

    • castItem (0..n). Cast items.
    • castGroup (0..n). Cast groups.
  • castItem. A cast item.

    Attributes:

    • type="role" or "list". Required.
    • rend="none". Do not render this cast item on the cast of characters page. Optional. This option is useful if you need to define tagging data for a speaker, but do not want the speaker to appear on the cast of characters page.

    Children:

    • role (0..n). Names of the characters. If the type of the castItem element is role, only one role child is permitted. If the type is list, multiple role children are permitted.
    • roleDesc (0..1). Description of the character's role.

    Cast items are rendered in one of several ways:

    • If both the name and the description are missing, nothing is rendered.
    • If both the name and descriptions are present, the name of the character is rendered in plain text, followed by a comma and a space, followed by the description in italics.
    • If the name is present but the description is missing, the name of the character is rendered in plain text.
    • If the name is missing but the description is present, the description is rendered in plain text.
    • If the type is list and there is more than one role child, only the description is rendered, in plain text.
  • role. A character.

    Attributes:

    • id. Optional. The id of the character. This id is referenced by who attributes of speech elements. The id is required if it is referenced in any speech, otherwise it is optional.
    • gender="male", "female", or "uncertainMixedOrUnknown". Required if the work supports speaker gender tagging and this speaker is referenced by a speech.
    • mortality="mortal", "immortalOrSupernatural", or "unknownOrOther". Required if the work supports speaker mortality tagging and this speaker is referenced by a speech.
    • originalName = Optional name of the character in the original language of the work. We use this attribute in the Early Greek Epic corpus, where the TEXT child is the character's name in English, and the originalName attribute is the character's name in Greek.

    Children:

    • TEXT (1). Required. The name of the character for rendering.
  • roleDesc. Role description.

    Children:

    • TEXT (1). The role description.
  • castGroup. A cast group.

    Attributes:

    • rend="none". Do not render this cast group on the cast of characters page. Optional. This option is useful if you need to define tagging data for a group of speakers, but do not want them to appear on the cast of characters page.

    Children:

    • title (1). The title of the cast group.
    • castItem (1..n). The members of the cast group.

    The title is rendered left-justified, with the members of the cast group rendered following the title and indented, one per line.

  • castGroup/title. A cast group title.

    Children:

    • TEXT (1). The cast group title.

< 

Parts of Speech XML File

< 

Table of Contents

< 

Standard Spelling XML Files