MADA + TOKAN

This page describes the changes that have been made in the most recent, major versions of MADA+TOKAN. This information is also presented (in greater detail) in the MADA.CHANGES file included with each MADA+TOKAN release.

Major Revision History

VersionRelease DateDescription
3.2January 2012A few convenience features, plus several minor bug fixes
3.1August 2010Significant Model improvements
3.0May 2010Added preprocessor, ALMOR, other convenience features. SVMs reorganized.
2.3.2 (== 2.32)July 2009Numerous bug fixes
2.0March 2008Major code refactoring, weighted features, and other improvements

Version 3.2 Changes

Added the option to produce TOKAN output that is encoded as Buckwalter, Safe Buckwalter or UTF-8
Added the option to build MADA using Aramorph (a free version of BAMA 1.2.1) instead of SAMA 3.1
Added the option to declare an output directory that all MADA and TOKAN output will be built in
Added a 'quiet' mode that will suppress MADA+TOKAN status messages
Added a GLOSS mode to TOKAN to output an English gloss as one of the TOKAN scheme forms
Altered the MADA output file format slightly; the ";;MADA" line (which displays the predictions of the SVM classifiers) has been renamed ";;SVM_PREDICTIONS" for clarity
Minor improvements to support scripts
Added a TOKAN-evaluate.pl script to compare TOKAN output for some simple cases.
Identified a bug in SRILM that can create differences in MADA output after minor changes to input; a patch for the SRILM is provided to fix this.
Fixed the handling of blank lines in TOKAN.
Fixed a bug which caused @@LAT@@ words to only have a single output form when the TOKAN_SCHEME specified several forms
Various minor bug fixes in TOKAN

Version 3.1 Changes

New models have trained using roughly twice the training data
A flaw that rendered the SVM models in MADA 3.0 sub-optimal has been removed
Miscellaneous bug fixes

Version 3.0 Changes

Added a preprocessor to handle input text cleaning and formatting
Replaced Aragen morphological analyzer with its successor, Almorgeana (ALMOR)
Adding a INSTALL.pl script to help with MADA installation
Refactored TOKAN; added the means to run multiple TOKAN schemes on the same file
Numerous changes to configuration variables for clarity and convenience
Reorganized N-gram models of lemmas and diacritics
Miscellaneous other bug fixes

Version 2.3.2 Changes

Miscellaneous bug fixes
Added morphological backoff options

Version 2.0 Changes

Refactored entire code base to make maintenance and improvements easier
Added tuned feature weights to improve analysis selection
Improved lemma and diacritic N-gram models
Added a few scripts to handle common tasks on MADA files, such as feature extraction
Numerous convenience features added, such as adding the ability to process gzipped files
Miscellaneous bug fixes