Below find answers to questions about MADA or TOKAN that are often asked.
MADA+TOKAN is (currently) a Perl-based application that uses third-party tools for N-gram and SVM modeling. Therefore, any system that can run Perl and those tools should be able to run MADA+TOKAN without a problem. That being said, MADA+TOKAN was developed for use in Linux/Unix environments, and it has not been throughly tested on Windows machines or Macs, or in virtual environments like Cygwin.
MADA+TOKAN requires the following:
The details on how to install the these tools for use by MADA is given in the MADA+TOKAN Manual. Note that SAMA 3.0, BAMA 2.0 or Aramorph 1.2.1 can be used in place of SAMA 3.1, but MADA has been optimized to work with SAMA 3.1.
MADA does not use the SAMA, BAMA or Aramorph software. MADA only makes use of the language prefix, stem and suffix tables developed for those tools -- not the actual morphological analyzer. During installation, those language tables are read and translated into a specialized database that Almorgeana uses when MADA runs. The process of creating this database refines the information to be more useful to the MADA task, corrects known errors, and creates a common format so that any of the SAMA, BAMA or Aramorph versions can be used. After installation, the SAMA/BAMA/Aramorph files are not needed again -- everything has been transferred to the Almorgeana database.
Unfortunately, we do not have permission to distribute any part of the SAMA or BAMA utilities. This means it is not possible for us to simply create the Almorgeana database and distribute it with the MADA release. Hence, users must acquire one of these tools and create the database on their system during installation.
Currently, the LDC is restricting SAMA 3.1 to members only. An older version, BAMA 2.0 (LDC2004L02), is available as well, but only members.
As of MADA 3.2, another option is the freely available Aramorph 1.2.1, which is effectively BAMA 1.2.1. Since it is a much older version, it is not annotated to have the same level of information or consistency as SAMA 3.1. This means the accuracy of a MADA install using Aramorph or BAMA 2.0 will not be as good as a MADA install that uses SAMA 3.1. In our tests, we found that a MADA+TOKAN build using Aramorph was able to reproduce the same tokenizations (TOKAN output) produced by a build using SAMA for about 99.4% of the words tested (using "SCHEME=ATB" as the TOKAN_SCHEME).
If at all possible, we recommend using SAMA 3.1 for best results.
On our systems, we have benchmarked MADA+TOKAN 3.2 at between 70 and 90 words per second (including pre-processing). Speed is greatly affected, however, by the actual words used in the text (words with many possible analyses take longer to process), the number of sentences in the text (lots of long sentences take longer), the size of the input file and the exact configuration used; user times may vary significantly.