< 

Introduction for Text Developers

< 

Table of Contents

< 

The Authors XML File


The Corpora XML File

The file corpora.xml defines all the corpora and their attributes.

The best way to learn the details of the corpora definition file is to study the standard NU file along with the formal specification given below. Open the standard NU corpora.xml file in your favorite text editor or XML editor. Position the window next to your browser window. As you read the descriptions for the individual elements below, look at how each element is used in the standard NU file.

The corpora definition file has the following elements:

  • WordHoardCorpora. The root element.

    Children:

    • corpus (0..n). Corpus definitions. The tabs with the corpus names at the top of the table of contents window appear in the same order as the corpus children.
  • corpus. Corpus definition.

    Attributes:

    • id. Required. The corpus id.
    • charset = roman or greek. The character set used by the corpus. Required.
    • posType = english or greek. The part of speech taxonomy used by the corpus. Required.

    Children:

    • title (1). Title.
    • taggingData (1). The tagging data categories supported by the corpus.
    • translations (0..1). Translations supported by the corpus, if any.
    • tconview (0..n). Table of contents views. If no views are defined, the default is a list of works sorted in alphabetical order by work id.
  • title. Title.

    Children:

    • TEXT (1). The corpus title.
  • taggingData. WordHoard tagging data categories.

    Children:

    • lemma (0..1). Present if the corpus supports lemma tagging.
    • pos (0..1). Present if the corpus supports part of speech tagging.
    • wordClass (0..1). Present if the corpus supports word class tagging.
    • spelling (0..1). Present if the corpus supports spelling tagging.
    • speaker (0..1). Present if the corpus supports speaker tagging.
    • gender (0..1). Present if the corpus supports speaker gender tagging.
    • mortality (0..1). Present if the corpus supports speaker mortality tagging.
    • prosodic (0..1). Present if the corpus supports prosodic tagging.
    • metricalShape (0..1). Present if the corpus supports metrical shape tagging.
    • pubDates (0..1). Present if the corpus supports publication date tagging.

    All of these children themselves have no children. They are typically specified as empty elements. E.g., <lemma/>.

  • translations. Translations supported by the corpus.

    Children:

    • translation (1..n). A supported translation.
    • description (0..1). A description of the translations.

    Translations are rendered in work windows in the same order they are listed in this element.

  • translation. A translation supported by the corpus.

    Attributes:

    • id. Required. The name (id) of the translation. E.g., "English" or "German".
  • description. A description of the translations. This description appears in the "Translations, Transcriptions, Etc." dialog.

    Children:

    • TEXT. The description.
  • tconview. A table of contents view. More than one view can be defined, in which case radio buttons appear at the top of the table of contents window to select them. For example, the Shakespeare corpus has two views named "By Genre" and "By Date".

    Attributes:

    • type. Required. The type of the view. There are five view types:
      • byTag. The view is a list of all the works in the corpus sorted in increasing alphabetical order by work id. This is the default view if no views are specified in the corpora.xml file.
      • byDate. The view is a list of all the works in the corpus sorted in increasing numerical order by the first year of the publication year range. By date views display the publication year range in parentheses following the work titles.
      • list. The view is a list of all the works in the corpus in the order specified by the work children.
      • category. The view is a list of categories in the order specified by the category children. Each category in turn contains a list of works. For example, the Shakespeare "By Genre" view is a category view.
      • byAuthor. The view is a list of authors in increasing alphabetical order. Each author contains a list of works by the author in increasing numerical order by the first year of the publication year range.
    • title. Required if there is more than one table of contents view. Optional and ignored if there is only one view. This attribute specifies the title of the radio button at the top of the table of contents window that is used to select this view.

    Children:

    • work. For views with type="list" the work children specify the works in order for the view.
    • category. For views with type="category" the category children specify the categories in order for the view.
  • work. A work for a table of contents view.

    Attributes:

    • id. Required. The id of the work. E.g., sha-ham for Hamlet.
  • category. A category for a table of contents view.

    Attributes:

    • title. Required. The category title as it should appear in the table of contents view.

    Children:

    • work. The works for the category, in order.

< 

Introduction for Text Developers

< 

Table of Contents

< 

The Authors XML File