MADA + TOKAN

Here you may find examples of each stage of the MADA+TOKAN operation and the results of each stage.

Example Input

Below is the (simple) original input file (containing just one sentence for this example). It is encoded in UTF-8, and has as the first word a sentence ID. MADA requires that only one sentence appear on each line of its input file.

   SAMPLE_ID:1    .اكد رئيس الحكومة الاسبانية خوسيه ماريا اثنار اليوم الخميس ان اسبانيا لم توقف المساعدة التي تقدمها للمغرب

MADA and TOKAN can be run concurrently with a single command. In the following examples, the outputs were generated by using the command:

       perl MADA+TOKAN.pl config=myconfig.madaconfig file=Test 

where myconfig.madaconfig is a MADA configuration file that has been adjusted to produce the output desired, and Test is a file that contains the above Arabic sentences in UTF-8 encoding.

Preprocessing Results

Below are the results of running the input sentences through MADA's preprocessor. Here, SENTENCE_IDS has been set to YES (to prevent the processing of the "SAMPLE_ID" sentence IDs as Arabic words) and SEPARATEPUNCT has been set to YES (to place whitespace around the period at the end of the sentence). The preprocessor converts the input UTF-8 encoded text to Buckwalter encoding. Had there been any non-Arabic UTF-8 or non-printable characters present, they would have been deleted. In addition, the preprocessor will collapse certain UTF-8 characters to ASCII equivalents (for example, ؟ is changed to ?, and French quotes « and » are changed to "). This is done to reduce word sparity. The preprocessor output is places in a file named input_file_name.bw.

  SAMPLE_ID:1      Akd r}ys AlHkwmp AlAsbAnyp xwsyh mAryA AvnAr Alywm Alxmys An
                   AsbAnyA lm twqf AlmsAEdp Alty tqdmhA llmgrb .

Morpohological Analysis Results

Below we show the results of the MADA morphological analysis step. These can be found in the generated input_file_name.bw.ma file, if REMOVE_MA_FILE is set to NO (otherwise, this file is deleted when MADA no longer needs it). For display purposes, line breaks have been inserted below; note that in the original file, each analysis appears on a single line. In addition, only the two words of the input file are shown below, for space reasons.

;;; SENTENCE_ID SAMPLE_ID:1
;;; SENTENCE Akd r}ys AlHkwmp AlAsbAnyp xwsyh mAryA AvnAr Alywm Alxmys An AsbAnyA lm twqf AlmsAEdp Alty tqdmhA llmgrb .
;;WORD Akd
diac:>akado lex:kAd-a_1 bw:>a/IV1S+kad/IV+o/IVSUFF_MOOD:J gloss:almost;hardly;no_sooner pos:verb prc3:0 prc2:0
     prc1:0 prc0:0 per:1 asp:i vox:a mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kad stemcat:IV_C_intr
diac:>akido lex:kAd-i_1 bw:>a/IV1S+kid/IV+o/IVSUFF_MOOD:J gloss:deceive;harm pos:verb prc3:0 prc2:0 prc1:0
     prc0:0 per:1 asp:i vox:a mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kid stemcat:IV_C
diac:>akodi lex:kadaY-i_1 bw:>a/IV1S+kodi/IV+(null)/IVSUFF_MOOD:J gloss:be_stingy;skimp pos:verb prc3:0 prc2:0
     prc1:0 prc0:0 per:1 asp:i vox:a mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kodi stemcat:IV-i
diac:>akud~a lex:kad~-u_1 bw:>a/IV1S+kud~/IV+a/IVSUFF_MOOD:J gloss:work_hard;exhaust pos:verb prc3:0 prc2:0 prc1:0
     prc0:0 per:1 asp:i vox:a mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kud~ stemcat:IV_Vd
diac:>akud~a lex:kad~-u_1 bw:>a/IV1S+kud~/IV+a/IVSUFF_MOOD:S gloss:work_hard;exhaust pos:verb prc3:0 prc2:0 prc1:0
     prc0:0 per:1 asp:i vox:a mod:s gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kud~ stemcat:IV_Vd
diac:>akud~u lex:kad~-u_1 bw:>a/IV1S+kud~/IV+u/IVSUFF_MOOD:I gloss:work_hard;exhaust pos:verb prc3:0 prc2:0 prc1:0
     prc0:0 per:1 asp:i vox:a mod:i gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kud~ stemcat:IV_Vd
diac:>ak~ada lex:>ak~ad_1 bw:+>ak~ad/PV+a/PVSUFF_SUBJ:3MS gloss:affirm;assure;confirm;guarantee pos:verb prc3:0
     prc2:0 prc1:0 prc0:0 per:3 asp:p vox:a mod:i gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:>ak~ad stemcat:PV
diac:>ukad~a lex:kad~aY_1 bw:>u/IV1S+kad~a/IV_PASS+(null)/IVSUFF_MOOD:J gloss:be_begged pos:verb prc3:0 prc2:0
     prc1:0 prc0:0 per:1 asp:i vox:p mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kad~a stemcat:IV-a_Pass_yu
diac:>ukad~i lex:kad~aY_1 bw:>u/IV1S+kad~i/IV+(null)/IVSUFF_MOOD:J gloss:beg pos:verb prc3:0 prc2:0 prc1:0 prc0:0
     per:1 asp:i vox:a mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kad~i stemcat:IV-i_yu
diac:>ukoda lex:>akodaY_1 bw:>u/IV1S+koda/IV_PASS+(null)/IVSUFF_MOOD:J gloss:be_given_little;be_skimped_on pos:verb prc3:0
     prc2:0 prc1:0 prc0:0 per:1 asp:i vox:p mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:koda stemcat:IV-a_Pass_yu
diac:>ukodi lex:>akodaY_1 bw:>u/IV1S+kodi/IV+(null)/IVSUFF_MOOD:J gloss:give_little;skimp pos:verb prc3:0 prc2:0
     prc1:0 prc0:0 per:1 asp:i vox:a mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kodi stemcat:IV-i_yu
--------------
;;WORD r}ys
diac:ra}iys lex:ra}iys_1 bw:+ra}iys/NOUN+ gloss:president;head;chairman pos:noun prc3:0 prc2:0 prc1:0 prc0:0
     per:na asp:na vox:na mod:na gen:m num:s stt:i cas:u enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
diac:ra}iysK lex:ra}iys_1 bw:+ra}iys/NOUN+K/CASE_INDEF_GEN gloss:president;head;chairman pos:noun prc3:0 prc2:0
     prc1:0 prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:i cas:g enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
diac:ra}iysN lex:ra}iys_1 bw:+ra}iys/NOUN+N/CASE_INDEF_NOM gloss:president;head;chairman pos:noun prc3:0 prc2:0
     prc1:0 prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:i cas:n enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
diac:ra}iysa lex:ra}iys_1 bw:+ra}iys/NOUN+a/CASE_DEF_ACC gloss:president;head;chairman pos:noun prc3:0 prc2:0
     prc1:0 prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:c cas:a enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
diac:ra}iysi lex:ra}iys_1 bw:+ra}iys/NOUN+i/CASE_DEF_GEN gloss:president;head;chairman pos:noun prc3:0 prc2:0
     prc1:0 prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:c cas:g enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
diac:ra}iysu lex:ra}iys_1 bw:+ra}iys/NOUN+u/CASE_DEF_NOM gloss:president;head;chairman pos:noun prc3:0 prc2:0
     prc1:0 prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
--------------

The output of the morphological analyzer for each input sentence consists of:

  1. A ";;; SENTENCE_ID" comment, if the original had sentence IDs and these were turned on in the configuration file
  2. A ";;; SENTENCE" comment which lists the full original sentence
  3. Each word in the sentence (in order) and a list of possible analyses for that word. The analyses at this point are in alphabetical order, and are independent of context. Each analysis list ends with a dashed line delimiter
  4. The end of the sentence is marked by a "SENTENCE BREAK" marker

The analyses that are produced consist of a number of feature:value pairs, starting with the fully diacritized word form (diac). These pairs show the various morphological features (part-of-speech, gender, case, number, mood, etc.) for the analysis, as well as other information used by MADA internally.

Analysis Selection and Final MADA Results

Below are final results of MADA. These can be found in the generated input_file_name.bw.mada file. Here, PRINT_ANALYSES has been set to STARS, meaning only the final, highest-scoring analysis for each word is output. Again, line breaks have been inserted below for display reasons.

;; MADA OUTPUT FILE  -- VERSION 3.1  --- File created on Fri Feb 18 16:23:17 EST 2011
;; This file was produced by the command line:
;;  perl /usr/local/nlp/Taggers/Arabic/MADA-3.1/MADA-selectMA.pl config=generic.madaconfig file=Test2.bw.ma REMOVE_MA_FILE=NO 
;;CLASSIFIERS CONSIDERED: (asp) (cas) (enc0) (gen) (mod) (num) (per) (pos) (prc0) (prc1) (prc2) (prc3) (stt) (vox)  
;;OPTIONS:  (Tie Breaking is arbitrary) (Print_Analyses is stars) 
;;Feature Weights:  asp = 0.0374588  cas = 1.11344  enc0 = 1.14823  featsetprob = 12.1301  gen = 1.02833
                    mod = 1.10554  ngramdiac = 0.626436  ngramlex = 1.00998  notbackoff = 2.49999  num = 1.31008
                    partenc0 = 0.856252  partprc0 = 1.17501  partprc1 = 1.1553  partprc2 = 1.14064  partprc3 = 1.07678
                    per = 1.08064  pos = 1.02607  prc0 = 2.27208  prc1 = 1.15673  prc2 = 2.03792  prc3 = 4.50757
                    spellmatch = 1.81239  stt = 1.16685  vox = 0.343867  
;;==========================================
;;; SENTENCE_ID SAMPLE_ID:1
;;; SENTENCE Akd r}ys AlHkwmp AlAsbAnyp xwsyh mAryA AvnAr Alywm Alxmys An AsbAnyA lm twqf AlmsAEdp Alty tqdmhA llmgrb .
;;WORD Akd
;;SVM_PREDICTIONS: Akd asp:p cas:na enc0:0 gen:m mod:i num:s per:3 pos:verb prc0:0 prc1:0 prc2:0 prc3:0 stt:na vox:a
*0.946921 diac:>ak~ada lex:>ak~ad_1 bw:+>ak~ad/PV+a/PVSUFF_SUBJ:3MS gloss:affirm;assure;confirm;guarantee
          pos:verb prc3:0 prc2:0 prc1:0 prc0:0 per:3 asp:p vox:a mod:i gen:m num:s stt:na cas:na enc0:0 rat:na
          source:spvar stem:>ak~ad stemcat:PV
--------------
;;WORD r}ys
;;SVM_PREDICTIONS: r}ys asp:na cas:n enc0:0 gen:m mod:na num:s per:na pos:noun prc0:0 prc1:0 prc2:0 prc3:0 stt:c vox:na
*1.003530 diac:ra}iysu lex:ra}iys_1 bw:+ra}iys/NOUN+u/CASE_DEF_NOM gloss:president;head;chairman pos:noun prc3:0
          prc2:0 prc1:0 prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
--------------
;;WORD AlHkwmp
;;SVM_PREDICTIONS: AlHkwmp asp:na cas:g enc0:0 gen:f mod:na num:s per:na pos:noun prc0:Al_det prc1:0 prc2:0 prc3:0 stt:d vox:na
*1.007589 diac:AlHukuwmapi lex:Hukuwmap_1 bw:Al/DET+Hukuwm/NOUN+ap/NSUFF_FEM_SG+i/CASE_DEF_GEN gloss:government;administration
          pos:noun prc3:0 prc2:0 prc1:0 prc0:Al_det per:na asp:na vox:na mod:na gen:f num:s stt:d cas:g enc0:0 rat:y
          source:lex stem:Hukuwm stemcat:Napdu
--------------
;;WORD AlAsbAnyp
;;SVM_PREDICTIONS: AlAsbAnyp asp:na cas:g enc0:0 gen:f mod:na num:s per:na pos:adj prc0:Al_det prc1:0 prc2:0 prc3:0 stt:d vox:na
*0.950711 diac:Al<isobAniy~api lex:<isobAniy~_2 bw:Al/DET+<isobAniy~/ADJ+ap/NSUFF_FEM_SG+i/CASE_DEF_GEN
          gloss:Spanish;Spaniard pos:adj prc3:0 prc2:0 prc1:0 prc0:Al_det per:na asp:na vox:na mod:na gen:f num:s stt:d
          cas:g enc0:0 rat:y source:spvar stem:<isobAniy~ stemcat:Nall
--------------
;;WORD xwsyh
;;SVM_PREDICTIONS: xwsyh asp:na cas:u enc0:0 gen:m mod:na num:s per:na pos:noun_prop prc0:0 prc1:0 prc2:0 prc3:0 stt:i vox:na
*1.022412 diac:xuwsiyh lex:xuwsiyh_1 bw:+xuwsiyh/NOUN_PROP+ gloss:Jose pos:noun_prop prc3:0 prc2:0 prc1:0 prc0:0 per:na
          asp:na vox:na mod:na gen:m num:s stt:i cas:u enc0:0 rat:y source:lex stem:xuwsiyh stemcat:Nprop
--------------
;;WORD mAryA
;;SVM_PREDICTIONS: mAryA asp:na cas:u enc0:0 gen:m mod:na num:s per:na pos:noun_prop prc0:0 prc1:0 prc2:0 prc3:0 stt:i vox:na
*1.022412 diac:mAriyA lex:mAriyA_1 bw:+mAriyA/NOUN_PROP+ gloss:Maria pos:noun_prop prc3:0 prc2:0 prc1:0 prc0:0 per:na
          asp:na vox:na mod:na gen:m num:s stt:i cas:u enc0:0 rat:y source:lex stem:mAriyA stemcat:Nprop
--------------
;;WORD AvnAr
;;NO-ANALYSIS
;;SVM_PREDICTIONS: AvnAr asp:na cas:n enc0:0 gen:m mod:na num:s per:na pos:noun_prop prc0:0 prc1:0 prc2:0 prc3:0 stt:i vox:na
NO-ANALYSIS [AvnAr]
--------------
;;WORD Alywm
;;SVM_PREDICTIONS: Alywm asp:na cas:a enc0:0 gen:m mod:na num:s per:na pos:noun prc0:Al_det prc1:0 prc2:0 prc3:0 stt:d vox:na
*1.004898 diac:Alyawoma lex:yawom_1 bw:Al/DET+yawom/NOUN+a/CASE_DEF_ACC gloss:day pos:noun prc3:0 prc2:0 prc1:0 prc0:Al_det
          per:na asp:na vox:na mod:na gen:m num:s stt:d cas:a enc0:0 rat:y source:lex stem:yawom stemcat:Ndu
--------------
;;WORD Alxmys
;;SVM_PREDICTIONS: Alxmys asp:na cas:a enc0:0 gen:m mod:na num:s per:na pos:noun prc0:Al_det prc1:0 prc2:0 prc3:0 stt:d vox:na
*1.004914 diac:Alxamiysa lex:xamiys_2 bw:Al/DET+xamiys/NOUN+a/CASE_DEF_ACC gloss:Thursday pos:noun prc3:0 prc2:0 prc1:0
          prc0:Al_det per:na asp:na vox:na mod:na gen:m num:s stt:d cas:a enc0:0 rat:y source:lex stem:xamiys stemcat:N
--------------
;;WORD An
;;SVM_PREDICTIONS: An asp:na cas:na enc0:0 gen:na mod:na num:na per:na pos:conj_sub prc0:na prc1:0 prc2:0 prc3:0 stt:na vox:na
*0.947612 diac:>an~a lex:>an~a_1 bw:+>an~a/SUB_CONJ+ gloss:that pos:conj_sub prc3:0 prc2:0 prc1:0 prc0:na per:na
          asp:na vox:na mod:na gen:na num:na stt:na cas:na enc0:0 rat:na source:spvar stem:>an~a stemcat:FW-Wa
--------------
;;WORD AsbAnyA
;;SVM_PREDICTIONS: AsbAnyA asp:na cas:u enc0:0 gen:m mod:na num:s per:na pos:noun_prop prc0:0 prc1:0 prc2:0 prc3:0 stt:i vox:na
*0.930441 diac:>asobAniyA lex:<isobAniyA_1 bw:+>asobAniyA/NOUN_PROP+ gloss:Spain pos:noun_prop prc3:0 prc2:0 prc1:0
          prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:i cas:u enc0:0 rat:y source:spvar stem:>asobAniyA stemcat:N0
--------------
;;WORD lm
;;SVM_PREDICTIONS: lm asp:na cas:na enc0:0 gen:na mod:na num:na per:na pos:part_neg prc0:na prc1:0 prc2:0 prc3:0 stt:na vox:na
*1.002028 diac:lamo lex:lamo_1 bw:+lamo/NEG_PART+ gloss:did_not pos:part_neg prc3:0 prc2:0 prc1:0 prc0:na per:na asp:na vox:na
          mod:na gen:na num:na stt:na cas:na enc0:0 rat:na source:lex stem:lamo stemcat:FW-Wa
--------------
;;WORD twqf
;;SVM_PREDICTIONS: twqf asp:i cas:na enc0:0 gen:f mod:s num:s per:3 pos:verb prc0:0 prc1:0 prc2:0 prc3:0 stt:na vox:a
*0.947557 diac:tuwaq~ifa lex:waq~af_1 bw:tu/IV3FS+waq~if/IV+a/IVSUFF_MOOD:S gloss:stop;detain pos:verb prc3:0 prc2:0 prc1:0
          prc0:0 per:3 asp:i vox:a mod:s gen:f num:s stt:na cas:na enc0:0 rat:na source:lex stem:waq~if stemcat:IV_yu
--------------
;;WORD AlmsAEdp
;;SVM_PREDICTIONS: AlmsAEdp asp:na cas:n enc0:0 gen:f mod:na num:s per:na pos:noun prc0:Al_det prc1:0 prc2:0 prc3:0 stt:d vox:na
*0.981206 diac:AlmusAEadapu lex:musAEadap_1 bw:Al/DET+musAEad/NOUN+ap/NSUFF_FEM_SG+u/CASE_DEF_NOM
          gloss:assistance;aid;support;help pos:noun prc3:0 prc2:0 prc1:0 prc0:Al_det per:na asp:na vox:na mod:na gen:f
          num:s stt:d cas:n enc0:0 rat:y source:lex stem:musAEad stemcat:NapAt
--------------
;;WORD Alty
;;SVM_PREDICTIONS: Alty asp:na cas:u enc0:0 gen:f mod:na num:s per:na pos:pron_rel prc0:0 prc1:0 prc2:0 prc3:0 stt:i vox:na
*1.002288 diac:Al~atiy lex:Al~a*iy_1 bw:+Al~atiy/REL_PRON+ gloss:which;who;whom_[fem.sg.] pos:pron_rel prc3:0 prc2:0 prc1:0
          prc0:0 per:na asp:na vox:na mod:na gen:f num:s stt:i cas:u enc0:0 rat:y source:lex stem:Al~atiy stemcat:FW-Wa
--------------
;;WORD tqdmhA
;;SVM_PREDICTIONS: tqdmhA asp:i cas:na enc0:3fs_dobj gen:f mod:i num:s per:3 pos:verb prc0:0 prc1:0 prc2:0 prc3:0 stt:na vox:a
*0.967279 diac:tuqad~imuhA lex:qad~am_1 bw:tu/IV3FS+qad~im/IV+u/IVSUFF_MOOD:I+hA/IVSUFF_DO:3FS gloss:offer;present;introduce
          pos:verb prc3:0 prc2:0 prc1:0 prc0:0 per:3 asp:i vox:a mod:i gen:f num:s stt:na cas:na enc0:3fs_dobj rat:na source:lex
          stem:qad~im stemcat:IV_yu
--------------
;;WORD llmgrb
;;SVM_PREDICTIONS: llmgrb asp:na cas:g enc0:0 gen:m mod:na num:s per:na pos:noun_prop prc0:Al_det prc1:li_prep prc2:0 prc3:0 stt:d vox:na
*1.000083 diac:lilmagoribi lex:magorib_1 bw:li/PREP+Al/DET+magorib/NOUN_PROP+i/CASE_DEF_GEN
          gloss:Maghreb;Maghrib_(northwest_Africa) pos:noun_prop prc3:0 prc2:0 prc1:li_prep prc0:Al_det per:na asp:na
          vox:na mod:na gen:m num:s stt:d cas:g enc0:0 rat:y source:lex stem:magorib stemcat:N
--------------
;;WORD .
;;SVM_PREDICTIONS: . asp:na cas:na enc0:na gen:na mod:na num:na per:na pos:punc prc0:na prc1:na prc2:na prc3:na stt:na vox:na
*1.045299 diac:. lex:._0 bw:./PUNC gloss:. pos:punc prc3:na prc2:na prc1:na prc0:na per:na asp:na vox:na mod:na gen:na
          num:na stt:na cas:na enc0:na rat:na source:punc
--------------
SENTENCE BREAK
--------------

Following a commented file header (which lists when MADA was run, what command line was used to run it, what options were used, what classifers were used, and what feature weights were active), the output of the MADA for each input sentence consists of:

  1. A ";;; SENTENCE_ID" comment, if the original had sentence IDs and these were turned on in the configuration file
  2. A ";;; SENTENCE" comment which lists the full original sentence
  3. The collection of word-level information, with each word consisting of consisting of:
    • A ";;WORD" comment listing the original word form
    • A ";;SVM_PREDICTIONS" line comment, which lists the predictions made by each of the SVM classifiers. Since these predictions were made independently, these predictions together may not combine into a legal analysis
    • The list of scored analyses for the word. The number of analyses that are output is controlled by the PRINT_ANALYSES configuaration variable. The analyses are identical to the ones that appeared in the output of the morphological analyzer, except they are appended with a score prefix.
      • The score prefix consists of a single character marker (* marks the top-scoreing analysis, ^ marks an analysis which was tied with the top-scoring analysis before tie breaking was applied, and _ is used to mark all other analyses) and a numerical score.
      • It must be noted that the numerical score is NOT (currently) a valid probability or normalized. It is simply a relative measure of worth. It should never be used in probabilistic studies.
  4. The end of the sentence is marked by a "SENTENCE BREAK" marker

TOKAN Results

Next is shown the results of running TOKAN on the MADA output, using a TOKAN_SCHEMES_FILE that specifies two separate tokenizations, each of which is written to a different file. Here, the TOKAN_SCHEME_FILE uses as its schemes "SCHEME=ATB MARKNOANALYSIS" in the first instance and "SCHEME=D3 MARKNOANALYSIS" in the second. Since SENTENCE_IDS is set to YES, the TOKAN_SCHEME variable is automatically extended by adding "SENT_ID" to ensure that the sentence IDs are passed through to the TOKAN output. The "SCHEME=ATB" term is a shorthand that specifies that the Penn Arabic Treebank tokenization should be used (tokenize all clitics except for the definite article, normalize alefs and yaas, use + characters as clitic markers, and replace ( and ) characters with -LRB- and -RRB- respectively). "SCHEME=D3" is similar, except that it also dicates that the definite article be separated. The "MARKNOANALYSIS" term indicates that, if MADA as noted a word as NO-ANALYSIS, than that word should appear in the TOKAN output as the original word form surrounded by @@; for example: @@UNKNOWN_WORD@@. The NO-ANALYSIS word remains untokenized, as tokenization of unknown words is not possible with reliable accuracy.

The output of TOKAN is written to input_file_name.bw.mada.TOKAN_extension, where TOKAN_extension is in the TOKAN_SCHEMES_FILE.

SCHEME=ATB3
SAMPLE_ID:1 Akd r}ys AlHkwmp AlAsbAnyp xwsyh mAryA @@AvnAr@@ Alywm Alxmys An AsbAnyA lm twqf
            AlmsAEdp Alty tqdm +hA l+ Almgrb .

SCHEME=D3
SAMPLE_ID:1 Akd r}ys Al+ Hkwmp Al+ AsbAnyp xwsyh mAryA @@AvnAr@@ Al+ ywm Al+ xmys An AsbAnyA lm twqf
            Al+ msAEdp Alty tqdm +hA l+ Al+ mgrb .

As with the input, each sentence appears on one line (line breaks are inserted above for display purposes). In the second sentence are found examples of a NO-ANALYSIS word, the replacement of ( and ) characters, and both pro- and enclitics that have been tokenized and marked with + characters.

Auxilliary Scripts

MADA comes with an number of auxilliary scripts which users may find useful. These scripts are typically run on the output of MADA (the .mada file) once MADA has finished. Here we show a few examples of two of these scripts run on the MADA output described above. Note that users can display the help for each of these scripts by running the scripts without arguments.

extractFeatureIntoSentenceFormat.pl

The script extractFeatureIntoSentenceFormat.pl reads a MADA file, and writes to STDOUT each sentence it finds there, replacing each word with its MADA-predicted value of the feature defined with the feat=<value> term. In this way, users can quickly develop a sentence-formatted list of a particular morphological feature. This is very useful, for example, in developing a N-gram language model of that feature using SRI's Language Modeling Toolkit. Here are examples of the script's use and output:

   perl extractFeatureIntoSentenceFormat.pl file=Test.bw.mada feat=pos sentids

   SAMPLE_ID:1 verb noun noun adj noun_prop noun_prop noun noun noun conj_sub noun_prop part_neg verb
               noun pron_rel verb noun_prop punc


   perl extractFeatureIntoSentenceFormat.pl file=Test.bw.mada feat=diac sentids

   SAMPLE_ID:1 >ak~ada ra}iysu AlHukuwmapi Al<isobAniy~api xuwsiyh mAriyA AvnAr Alyawoma Alxamiysa
               >an~a >asobAniyA lamo tuwaq~ifa AlmusAEadapu Al~atiy tuqad~imuhA lilmagoribi .


   perl extractFeatureIntoSentenceFormat.pl file=Test.bw.mada feat=gen sentids includeword

   SAMPLE_ID:1 Akd:m r}ys:m AlHkwmp:f AlAsbAnyp:f xwsyh:m mAryA:m AvnAr:m Alywm:m Alxmys:m An:na AsbAnyA:m
               lm:na twqf:f AlmsAEdp:f Alty:f tqdmhA:f llmgrb:m .:na


extractFeaturesIntoColumns.pl

This script reads a MADA file, and writes to STDOUT several user-specified columns for various features extracted from the MADA file. The first column is always the original word form, with the following tab-separated columns containing features specified in the command line argument. In this way, users can quickly extract any subset of morphological features without having to parse the MADA file. Sentence breaks are marked by a blank line in the output. Also, the first line contains column headers listing the features presented. Below is the output for diacritized form, part-of-speech and lexeme columns.

   perl extractFeaturesIntoColumns.pl file=Test.bw.mada feats=diac,pos,lex

   WORD	        diac	         pos	      lex
   Akd 	        >ak~ada	         verb	      >ak~ad_1
   r}ys	        ra}iysu	         noun	      ra}iys_1
   AlHkwmp	AlHukuwmapi	 noun	      Hukuwmap_1
   AlAsbAnyp	Al<isobAniy~api	 adj	      <isobAniy~_2
   xwsyh	xuwsiyh	         noun_prop    xuwsiyh_1
   mAryA	mAriyA	         noun_prop    mAriyA_1
   AvnAr	AvnAr	         noun	      AvnAr
   Alywm	Alyawoma	 noun	      yawom_1
   Alxmys	Alxamiysa	 noun	      xamiys_2  
   An	        >an~a	         conj_sub     >an~a_1
   AsbAnyA	>asobAniyA	 noun_prop    <isobAniyA_1
   lm	        lamo	         part_neg     lamo_1
   twqf	        tuwaq~ifa	 verb	      waq~af_1
   AlmsAEdp	AlmusAEadapu	 noun	      musAEadap_1
   Alty	        Al~atiy	         pron_rel     Al~a*iy_1
   tqdmhA	tuqad~imuhA	 verb	      qad~am_1
   llmgrb	lilmagoribi	 noun_prop    magorib_1
   .	        .	         punc	      ._0