< 

The Work Sets XML File

< 

Table of Contents


Martin Mueller's Tagging Data

This chapter is for Northwestern University developers only.

Professor Martin Mueller uses a Microsoft Access database to maintain the morphological tagging data for WordHoard's four primary corpora (Shakespeare, Chaucer, Spenser, and Early Greek Epic). His database is currently located on the host ariadne.at.northwestern.edu, in the following location:


E:\Users\shared\NUPOS\NUPOS.mdb

Only the Northwestern WordHoard developers have access to the Ariadne host.

When Martin makes changes to his tagging data, we must rebuild the raw data XML files and rebuild the static object model with his new data. To do this, we export the tables we need from his database to plain text files, copy the files over to our development machine, and import them into a MySQL database named martin. Then we run a script which updates the XML definition files with his new tagging data, and we do a full build of the primary wordhoard database.

Edit the properties file properties/martin.properties to set the values of your MySQL root username and password.

Create a subdirectory of your WordHoard development directory named martin.

To create a new empty martin database, use the create-martin-database alias:


% create-martin-database

You only need to create the martin database once, unless the database structure changes (which it might).

We have saved export operations in Access on Ariadne to export the tables we need from the database. To get a new version of his data, open the database in Access and run the following saved export operations:


NUPOS_WordClass -> NUPOS_WordClass.txt
NUPOS_EnglishGreek -> NUPOS_EnglishGreek.txt
NUPOSTrainingData -> NUPOSTrainingData.txt
NUPOS_GreekData -> NUPOS_GreekData.txt

The exported plain text files are written to the directory E:\Users\shared\Exports. Copy all of the files in this directory over the network into the martin directory on your development machine.

Note: It is very important that Access export operations use the UTF8 code page.

To import Martin's data into the MySQL martin database, use the import-all alias:


% import-all

To update the XML data files, use the martin-update alias:


% martin-update

This alias runs the martin-update script and redirects stdout to the report file martin/report.txt. The report contains error messages that start with "#####".

The martin-update script writes new versions of the two files word-classes.xml and pos.xml. It also reads the work XML files for the four NU corpora and updates the attributes in them for the morphological tagging data (lemma and part of speech).

Finally, to rebuild the static object model with Martin's new data, use the full-build alias:


% full-build

< 

The Work Sets XML File

< 

Table of Contents