Here you may find examples of each stage of the MADA+TOKAN operation and the results of each stage.
Below is the (simple) original input file (containing just one sentence for this example). It is encoded in UTF-8, and has as the first word a sentence ID. MADA requires that only one sentence appear on each line of its input file.
SAMPLE_ID:1 .اكد رئيس الحكومة الاسبانية خوسيه ماريا اثنار اليوم الخميس ان اسبانيا لم توقف المساعدة التي تقدمها للمغرب
MADA and TOKAN can be run concurrently with a single command. In the following examples, the outputs were generated by using the command:
perl MADA+TOKAN.pl config=myconfig.madaconfig file=Test
where myconfig.madaconfig is a MADA configuration file that has been adjusted to produce the output desired, and Test is a file that contains the above Arabic sentences in UTF-8 encoding.
Below are the results of running the input sentences through MADA's preprocessor. Here, SENTENCE_IDS has been set to YES (to prevent the processing of the "SAMPLE_ID" sentence IDs as Arabic words) and SEPARATEPUNCT has been set to YES (to place whitespace around the period at the end of the sentence). The preprocessor converts the input UTF-8 encoded text to Buckwalter encoding. Had there been any non-Arabic UTF-8 or non-printable characters present, they would have been deleted. In addition, the preprocessor will collapse certain UTF-8 characters to ASCII equivalents (for example, ؟ is changed to ?, and French quotes « and » are changed to "). This is done to reduce word sparity. The preprocessor output is places in a file named input_file_name.bw.
SAMPLE_ID:1 Akd r}ys AlHkwmp AlAsbAnyp xwsyh mAryA AvnAr Alywm Alxmys An
AsbAnyA lm twqf AlmsAEdp Alty tqdmhA llmgrb .
Below we show the results of the MADA morphological analysis step. These can be found in the generated input_file_name.bw.ma file, if REMOVE_MA_FILE is set to NO (otherwise, this file is deleted when MADA no longer needs it). For display purposes, line breaks have been inserted below; note that in the original file, each analysis appears on a single line. In addition, only the two words of the input file are shown below, for space reasons.
;;; SENTENCE_ID SAMPLE_ID:1
;;; SENTENCE Akd r}ys AlHkwmp AlAsbAnyp xwsyh mAryA AvnAr Alywm Alxmys An AsbAnyA lm twqf AlmsAEdp Alty tqdmhA llmgrb .
;;WORD Akd
diac:>akado lex:kAd-a_1 bw:>a/IV1S+kad/IV+o/IVSUFF_MOOD:J gloss:almost;hardly;no_sooner pos:verb prc3:0 prc2:0
prc1:0 prc0:0 per:1 asp:i vox:a mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kad stemcat:IV_C_intr
diac:>akido lex:kAd-i_1 bw:>a/IV1S+kid/IV+o/IVSUFF_MOOD:J gloss:deceive;harm pos:verb prc3:0 prc2:0 prc1:0
prc0:0 per:1 asp:i vox:a mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kid stemcat:IV_C
diac:>akodi lex:kadaY-i_1 bw:>a/IV1S+kodi/IV+(null)/IVSUFF_MOOD:J gloss:be_stingy;skimp pos:verb prc3:0 prc2:0
prc1:0 prc0:0 per:1 asp:i vox:a mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kodi stemcat:IV-i
diac:>akud~a lex:kad~-u_1 bw:>a/IV1S+kud~/IV+a/IVSUFF_MOOD:J gloss:work_hard;exhaust pos:verb prc3:0 prc2:0 prc1:0
prc0:0 per:1 asp:i vox:a mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kud~ stemcat:IV_Vd
diac:>akud~a lex:kad~-u_1 bw:>a/IV1S+kud~/IV+a/IVSUFF_MOOD:S gloss:work_hard;exhaust pos:verb prc3:0 prc2:0 prc1:0
prc0:0 per:1 asp:i vox:a mod:s gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kud~ stemcat:IV_Vd
diac:>akud~u lex:kad~-u_1 bw:>a/IV1S+kud~/IV+u/IVSUFF_MOOD:I gloss:work_hard;exhaust pos:verb prc3:0 prc2:0 prc1:0
prc0:0 per:1 asp:i vox:a mod:i gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kud~ stemcat:IV_Vd
diac:>ak~ada lex:>ak~ad_1 bw:+>ak~ad/PV+a/PVSUFF_SUBJ:3MS gloss:affirm;assure;confirm;guarantee pos:verb prc3:0
prc2:0 prc1:0 prc0:0 per:3 asp:p vox:a mod:i gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:>ak~ad stemcat:PV
diac:>ukad~a lex:kad~aY_1 bw:>u/IV1S+kad~a/IV_PASS+(null)/IVSUFF_MOOD:J gloss:be_begged pos:verb prc3:0 prc2:0
prc1:0 prc0:0 per:1 asp:i vox:p mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kad~a stemcat:IV-a_Pass_yu
diac:>ukad~i lex:kad~aY_1 bw:>u/IV1S+kad~i/IV+(null)/IVSUFF_MOOD:J gloss:beg pos:verb prc3:0 prc2:0 prc1:0 prc0:0
per:1 asp:i vox:a mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kad~i stemcat:IV-i_yu
diac:>ukoda lex:>akodaY_1 bw:>u/IV1S+koda/IV_PASS+(null)/IVSUFF_MOOD:J gloss:be_given_little;be_skimped_on pos:verb prc3:0
prc2:0 prc1:0 prc0:0 per:1 asp:i vox:p mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:koda stemcat:IV-a_Pass_yu
diac:>ukodi lex:>akodaY_1 bw:>u/IV1S+kodi/IV+(null)/IVSUFF_MOOD:J gloss:give_little;skimp pos:verb prc3:0 prc2:0
prc1:0 prc0:0 per:1 asp:i vox:a mod:j gen:m num:s stt:na cas:na enc0:0 rat:na source:spvar stem:kodi stemcat:IV-i_yu
--------------
;;WORD r}ys
diac:ra}iys lex:ra}iys_1 bw:+ra}iys/NOUN+ gloss:president;head;chairman pos:noun prc3:0 prc2:0 prc1:0 prc0:0
per:na asp:na vox:na mod:na gen:m num:s stt:i cas:u enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
diac:ra}iysK lex:ra}iys_1 bw:+ra}iys/NOUN+K/CASE_INDEF_GEN gloss:president;head;chairman pos:noun prc3:0 prc2:0
prc1:0 prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:i cas:g enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
diac:ra}iysN lex:ra}iys_1 bw:+ra}iys/NOUN+N/CASE_INDEF_NOM gloss:president;head;chairman pos:noun prc3:0 prc2:0
prc1:0 prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:i cas:n enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
diac:ra}iysa lex:ra}iys_1 bw:+ra}iys/NOUN+a/CASE_DEF_ACC gloss:president;head;chairman pos:noun prc3:0 prc2:0
prc1:0 prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:c cas:a enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
diac:ra}iysi lex:ra}iys_1 bw:+ra}iys/NOUN+i/CASE_DEF_GEN gloss:president;head;chairman pos:noun prc3:0 prc2:0
prc1:0 prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:c cas:g enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
diac:ra}iysu lex:ra}iys_1 bw:+ra}iys/NOUN+u/CASE_DEF_NOM gloss:president;head;chairman pos:noun prc3:0 prc2:0
prc1:0 prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
--------------
The output of the morphological analyzer for each input sentence consists of:
The analyses that are produced consist of a number of feature:value pairs, starting with the fully diacritized word form (diac). These pairs show the various morphological features (part-of-speech, gender, case, number, mood, etc.) for the analysis, as well as other information used by MADA internally.
Below are final results of MADA. These can be found in the generated input_file_name.bw.mada file. Here, PRINT_ANALYSES has been set to STARS, meaning only the final, highest-scoring analysis for each word is output. Again, line breaks have been inserted below for display reasons.
;; MADA OUTPUT FILE -- VERSION 3.1 --- File created on Fri Feb 18 16:23:17 EST 2011
;; This file was produced by the command line:
;; perl /usr/local/nlp/Taggers/Arabic/MADA-3.1/MADA-selectMA.pl config=generic.madaconfig file=Test2.bw.ma REMOVE_MA_FILE=NO
;;CLASSIFIERS CONSIDERED: (asp) (cas) (enc0) (gen) (mod) (num) (per) (pos) (prc0) (prc1) (prc2) (prc3) (stt) (vox)
;;OPTIONS: (Tie Breaking is arbitrary) (Print_Analyses is stars)
;;Feature Weights: asp = 0.0374588 cas = 1.11344 enc0 = 1.14823 featsetprob = 12.1301 gen = 1.02833
mod = 1.10554 ngramdiac = 0.626436 ngramlex = 1.00998 notbackoff = 2.49999 num = 1.31008
partenc0 = 0.856252 partprc0 = 1.17501 partprc1 = 1.1553 partprc2 = 1.14064 partprc3 = 1.07678
per = 1.08064 pos = 1.02607 prc0 = 2.27208 prc1 = 1.15673 prc2 = 2.03792 prc3 = 4.50757
spellmatch = 1.81239 stt = 1.16685 vox = 0.343867
;;==========================================
;;; SENTENCE_ID SAMPLE_ID:1
;;; SENTENCE Akd r}ys AlHkwmp AlAsbAnyp xwsyh mAryA AvnAr Alywm Alxmys An AsbAnyA lm twqf AlmsAEdp Alty tqdmhA llmgrb .
;;WORD Akd
;;SVM_PREDICTIONS: Akd asp:p cas:na enc0:0 gen:m mod:i num:s per:3 pos:verb prc0:0 prc1:0 prc2:0 prc3:0 stt:na vox:a
*0.946921 diac:>ak~ada lex:>ak~ad_1 bw:+>ak~ad/PV+a/PVSUFF_SUBJ:3MS gloss:affirm;assure;confirm;guarantee
pos:verb prc3:0 prc2:0 prc1:0 prc0:0 per:3 asp:p vox:a mod:i gen:m num:s stt:na cas:na enc0:0 rat:na
source:spvar stem:>ak~ad stemcat:PV
--------------
;;WORD r}ys
;;SVM_PREDICTIONS: r}ys asp:na cas:n enc0:0 gen:m mod:na num:s per:na pos:noun prc0:0 prc1:0 prc2:0 prc3:0 stt:c vox:na
*1.003530 diac:ra}iysu lex:ra}iys_1 bw:+ra}iys/NOUN+u/CASE_DEF_NOM gloss:president;head;chairman pos:noun prc3:0
prc2:0 prc1:0 prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:0 rat:y source:lex stem:ra}iys stemcat:N/ap
--------------
;;WORD AlHkwmp
;;SVM_PREDICTIONS: AlHkwmp asp:na cas:g enc0:0 gen:f mod:na num:s per:na pos:noun prc0:Al_det prc1:0 prc2:0 prc3:0 stt:d vox:na
*1.007589 diac:AlHukuwmapi lex:Hukuwmap_1 bw:Al/DET+Hukuwm/NOUN+ap/NSUFF_FEM_SG+i/CASE_DEF_GEN gloss:government;administration
pos:noun prc3:0 prc2:0 prc1:0 prc0:Al_det per:na asp:na vox:na mod:na gen:f num:s stt:d cas:g enc0:0 rat:y
source:lex stem:Hukuwm stemcat:Napdu
--------------
;;WORD AlAsbAnyp
;;SVM_PREDICTIONS: AlAsbAnyp asp:na cas:g enc0:0 gen:f mod:na num:s per:na pos:adj prc0:Al_det prc1:0 prc2:0 prc3:0 stt:d vox:na
*0.950711 diac:Al<isobAniy~api lex:<isobAniy~_2 bw:Al/DET+<isobAniy~/ADJ+ap/NSUFF_FEM_SG+i/CASE_DEF_GEN
gloss:Spanish;Spaniard pos:adj prc3:0 prc2:0 prc1:0 prc0:Al_det per:na asp:na vox:na mod:na gen:f num:s stt:d
cas:g enc0:0 rat:y source:spvar stem:<isobAniy~ stemcat:Nall
--------------
;;WORD xwsyh
;;SVM_PREDICTIONS: xwsyh asp:na cas:u enc0:0 gen:m mod:na num:s per:na pos:noun_prop prc0:0 prc1:0 prc2:0 prc3:0 stt:i vox:na
*1.022412 diac:xuwsiyh lex:xuwsiyh_1 bw:+xuwsiyh/NOUN_PROP+ gloss:Jose pos:noun_prop prc3:0 prc2:0 prc1:0 prc0:0 per:na
asp:na vox:na mod:na gen:m num:s stt:i cas:u enc0:0 rat:y source:lex stem:xuwsiyh stemcat:Nprop
--------------
;;WORD mAryA
;;SVM_PREDICTIONS: mAryA asp:na cas:u enc0:0 gen:m mod:na num:s per:na pos:noun_prop prc0:0 prc1:0 prc2:0 prc3:0 stt:i vox:na
*1.022412 diac:mAriyA lex:mAriyA_1 bw:+mAriyA/NOUN_PROP+ gloss:Maria pos:noun_prop prc3:0 prc2:0 prc1:0 prc0:0 per:na
asp:na vox:na mod:na gen:m num:s stt:i cas:u enc0:0 rat:y source:lex stem:mAriyA stemcat:Nprop
--------------
;;WORD AvnAr
;;NO-ANALYSIS
;;SVM_PREDICTIONS: AvnAr asp:na cas:n enc0:0 gen:m mod:na num:s per:na pos:noun_prop prc0:0 prc1:0 prc2:0 prc3:0 stt:i vox:na
NO-ANALYSIS [AvnAr]
--------------
;;WORD Alywm
;;SVM_PREDICTIONS: Alywm asp:na cas:a enc0:0 gen:m mod:na num:s per:na pos:noun prc0:Al_det prc1:0 prc2:0 prc3:0 stt:d vox:na
*1.004898 diac:Alyawoma lex:yawom_1 bw:Al/DET+yawom/NOUN+a/CASE_DEF_ACC gloss:day pos:noun prc3:0 prc2:0 prc1:0 prc0:Al_det
per:na asp:na vox:na mod:na gen:m num:s stt:d cas:a enc0:0 rat:y source:lex stem:yawom stemcat:Ndu
--------------
;;WORD Alxmys
;;SVM_PREDICTIONS: Alxmys asp:na cas:a enc0:0 gen:m mod:na num:s per:na pos:noun prc0:Al_det prc1:0 prc2:0 prc3:0 stt:d vox:na
*1.004914 diac:Alxamiysa lex:xamiys_2 bw:Al/DET+xamiys/NOUN+a/CASE_DEF_ACC gloss:Thursday pos:noun prc3:0 prc2:0 prc1:0
prc0:Al_det per:na asp:na vox:na mod:na gen:m num:s stt:d cas:a enc0:0 rat:y source:lex stem:xamiys stemcat:N
--------------
;;WORD An
;;SVM_PREDICTIONS: An asp:na cas:na enc0:0 gen:na mod:na num:na per:na pos:conj_sub prc0:na prc1:0 prc2:0 prc3:0 stt:na vox:na
*0.947612 diac:>an~a lex:>an~a_1 bw:+>an~a/SUB_CONJ+ gloss:that pos:conj_sub prc3:0 prc2:0 prc1:0 prc0:na per:na
asp:na vox:na mod:na gen:na num:na stt:na cas:na enc0:0 rat:na source:spvar stem:>an~a stemcat:FW-Wa
--------------
;;WORD AsbAnyA
;;SVM_PREDICTIONS: AsbAnyA asp:na cas:u enc0:0 gen:m mod:na num:s per:na pos:noun_prop prc0:0 prc1:0 prc2:0 prc3:0 stt:i vox:na
*0.930441 diac:>asobAniyA lex:<isobAniyA_1 bw:+>asobAniyA/NOUN_PROP+ gloss:Spain pos:noun_prop prc3:0 prc2:0 prc1:0
prc0:0 per:na asp:na vox:na mod:na gen:m num:s stt:i cas:u enc0:0 rat:y source:spvar stem:>asobAniyA stemcat:N0
--------------
;;WORD lm
;;SVM_PREDICTIONS: lm asp:na cas:na enc0:0 gen:na mod:na num:na per:na pos:part_neg prc0:na prc1:0 prc2:0 prc3:0 stt:na vox:na
*1.002028 diac:lamo lex:lamo_1 bw:+lamo/NEG_PART+ gloss:did_not pos:part_neg prc3:0 prc2:0 prc1:0 prc0:na per:na asp:na vox:na
mod:na gen:na num:na stt:na cas:na enc0:0 rat:na source:lex stem:lamo stemcat:FW-Wa
--------------
;;WORD twqf
;;SVM_PREDICTIONS: twqf asp:i cas:na enc0:0 gen:f mod:s num:s per:3 pos:verb prc0:0 prc1:0 prc2:0 prc3:0 stt:na vox:a
*0.947557 diac:tuwaq~ifa lex:waq~af_1 bw:tu/IV3FS+waq~if/IV+a/IVSUFF_MOOD:S gloss:stop;detain pos:verb prc3:0 prc2:0 prc1:0
prc0:0 per:3 asp:i vox:a mod:s gen:f num:s stt:na cas:na enc0:0 rat:na source:lex stem:waq~if stemcat:IV_yu
--------------
;;WORD AlmsAEdp
;;SVM_PREDICTIONS: AlmsAEdp asp:na cas:n enc0:0 gen:f mod:na num:s per:na pos:noun prc0:Al_det prc1:0 prc2:0 prc3:0 stt:d vox:na
*0.981206 diac:AlmusAEadapu lex:musAEadap_1 bw:Al/DET+musAEad/NOUN+ap/NSUFF_FEM_SG+u/CASE_DEF_NOM
gloss:assistance;aid;support;help pos:noun prc3:0 prc2:0 prc1:0 prc0:Al_det per:na asp:na vox:na mod:na gen:f
num:s stt:d cas:n enc0:0 rat:y source:lex stem:musAEad stemcat:NapAt
--------------
;;WORD Alty
;;SVM_PREDICTIONS: Alty asp:na cas:u enc0:0 gen:f mod:na num:s per:na pos:pron_rel prc0:0 prc1:0 prc2:0 prc3:0 stt:i vox:na
*1.002288 diac:Al~atiy lex:Al~a*iy_1 bw:+Al~atiy/REL_PRON+ gloss:which;who;whom_[fem.sg.] pos:pron_rel prc3:0 prc2:0 prc1:0
prc0:0 per:na asp:na vox:na mod:na gen:f num:s stt:i cas:u enc0:0 rat:y source:lex stem:Al~atiy stemcat:FW-Wa
--------------
;;WORD tqdmhA
;;SVM_PREDICTIONS: tqdmhA asp:i cas:na enc0:3fs_dobj gen:f mod:i num:s per:3 pos:verb prc0:0 prc1:0 prc2:0 prc3:0 stt:na vox:a
*0.967279 diac:tuqad~imuhA lex:qad~am_1 bw:tu/IV3FS+qad~im/IV+u/IVSUFF_MOOD:I+hA/IVSUFF_DO:3FS gloss:offer;present;introduce
pos:verb prc3:0 prc2:0 prc1:0 prc0:0 per:3 asp:i vox:a mod:i gen:f num:s stt:na cas:na enc0:3fs_dobj rat:na source:lex
stem:qad~im stemcat:IV_yu
--------------
;;WORD llmgrb
;;SVM_PREDICTIONS: llmgrb asp:na cas:g enc0:0 gen:m mod:na num:s per:na pos:noun_prop prc0:Al_det prc1:li_prep prc2:0 prc3:0 stt:d vox:na
*1.000083 diac:lilmagoribi lex:magorib_1 bw:li/PREP+Al/DET+magorib/NOUN_PROP+i/CASE_DEF_GEN
gloss:Maghreb;Maghrib_(northwest_Africa) pos:noun_prop prc3:0 prc2:0 prc1:li_prep prc0:Al_det per:na asp:na
vox:na mod:na gen:m num:s stt:d cas:g enc0:0 rat:y source:lex stem:magorib stemcat:N
--------------
;;WORD .
;;SVM_PREDICTIONS: . asp:na cas:na enc0:na gen:na mod:na num:na per:na pos:punc prc0:na prc1:na prc2:na prc3:na stt:na vox:na
*1.045299 diac:. lex:._0 bw:./PUNC gloss:. pos:punc prc3:na prc2:na prc1:na prc0:na per:na asp:na vox:na mod:na gen:na
num:na stt:na cas:na enc0:na rat:na source:punc
--------------
SENTENCE BREAK
--------------Following a commented file header (which lists when MADA was run, what command line was used to run it, what options were used, what classifers were used, and what feature weights were active), the output of the MADA for each input sentence consists of:
Next is shown the results of running TOKAN on the MADA output, using a TOKAN_SCHEMES_FILE that specifies two separate tokenizations, each of which is written to a different file. Here, the TOKAN_SCHEME_FILE uses as its schemes "SCHEME=ATB MARKNOANALYSIS" in the first instance and "SCHEME=D3 MARKNOANALYSIS" in the second. Since SENTENCE_IDS is set to YES, the TOKAN_SCHEME variable is automatically extended by adding "SENT_ID" to ensure that the sentence IDs are passed through to the TOKAN output. The "SCHEME=ATB" term is a shorthand that specifies that the Penn Arabic Treebank tokenization should be used (tokenize all clitics except for the definite article, normalize alefs and yaas, use + characters as clitic markers, and replace ( and ) characters with -LRB- and -RRB- respectively). "SCHEME=D3" is similar, except that it also dicates that the definite article be separated. The "MARKNOANALYSIS" term indicates that, if MADA as noted a word as NO-ANALYSIS, than that word should appear in the TOKAN output as the original word form surrounded by @@; for example: @@UNKNOWN_WORD@@. The NO-ANALYSIS word remains untokenized, as tokenization of unknown words is not possible with reliable accuracy.
The output of TOKAN is written to input_file_name.bw.mada.TOKAN_extension, where TOKAN_extension is in the TOKAN_SCHEMES_FILE.
SCHEME=ATB3
SAMPLE_ID:1 Akd r}ys AlHkwmp AlAsbAnyp xwsyh mAryA @@AvnAr@@ Alywm Alxmys An AsbAnyA lm twqf
AlmsAEdp Alty tqdm +hA l+ Almgrb .
SCHEME=D3
SAMPLE_ID:1 Akd r}ys Al+ Hkwmp Al+ AsbAnyp xwsyh mAryA @@AvnAr@@ Al+ ywm Al+ xmys An AsbAnyA lm twqf
Al+ msAEdp Alty tqdm +hA l+ Al+ mgrb .
As with the input, each sentence appears on one line (line breaks are inserted above for display purposes). In the second sentence are found examples of a NO-ANALYSIS word, the replacement of ( and ) characters, and both pro- and enclitics that have been tokenized and marked with + characters.
MADA comes with an number of auxilliary scripts which users may find useful. These scripts are typically run on the output of MADA (the .mada file) once MADA has finished. Here we show a few examples of two of these scripts run on the MADA output described above. Note that users can display the help for each of these scripts by running the scripts without arguments.
The script extractFeatureIntoSentenceFormat.pl reads a MADA file, and writes to STDOUT each sentence it finds there, replacing each word with its MADA-predicted value of the feature defined with the feat=<value> term. In this way, users can quickly develop a sentence-formatted list of a particular morphological feature. This is very useful, for example, in developing a N-gram language model of that feature using SRI's Language Modeling Toolkit. Here are examples of the script's use and output:
perl extractFeatureIntoSentenceFormat.pl file=Test.bw.mada feat=pos sentids
SAMPLE_ID:1 verb noun noun adj noun_prop noun_prop noun noun noun conj_sub noun_prop part_neg verb
noun pron_rel verb noun_prop punc
perl extractFeatureIntoSentenceFormat.pl file=Test.bw.mada feat=diac sentids
SAMPLE_ID:1 >ak~ada ra}iysu AlHukuwmapi Al<isobAniy~api xuwsiyh mAriyA AvnAr Alyawoma Alxamiysa
>an~a >asobAniyA lamo tuwaq~ifa AlmusAEadapu Al~atiy tuqad~imuhA lilmagoribi .
perl extractFeatureIntoSentenceFormat.pl file=Test.bw.mada feat=gen sentids includeword
SAMPLE_ID:1 Akd:m r}ys:m AlHkwmp:f AlAsbAnyp:f xwsyh:m mAryA:m AvnAr:m Alywm:m Alxmys:m An:na AsbAnyA:m
lm:na twqf:f AlmsAEdp:f Alty:f tqdmhA:f llmgrb:m .:na
This script reads a MADA file, and writes to STDOUT several user-specified columns for various features extracted from the MADA file. The first column is always the original word form, with the following tab-separated columns containing features specified in the command line argument. In this way, users can quickly extract any subset of morphological features without having to parse the MADA file. Sentence breaks are marked by a blank line in the output. Also, the first line contains column headers listing the features presented. Below is the output for diacritized form, part-of-speech and lexeme columns.
perl extractFeaturesIntoColumns.pl file=Test.bw.mada feats=diac,pos,lex WORD diac pos lex Akd >ak~ada verb >ak~ad_1 r}ys ra}iysu noun ra}iys_1 AlHkwmp AlHukuwmapi noun Hukuwmap_1 AlAsbAnyp Al<isobAniy~api adj <isobAniy~_2 xwsyh xuwsiyh noun_prop xuwsiyh_1 mAryA mAriyA noun_prop mAriyA_1 AvnAr AvnAr noun AvnAr Alywm Alyawoma noun yawom_1 Alxmys Alxamiysa noun xamiys_2 An >an~a conj_sub >an~a_1 AsbAnyA >asobAniyA noun_prop <isobAniyA_1 lm lamo part_neg lamo_1 twqf tuwaq~ifa verb waq~af_1 AlmsAEdp AlmusAEadapu noun musAEadap_1 Alty Al~atiy pron_rel Al~a*iy_1 tqdmhA tuqad~imuhA verb qad~am_1 llmgrb lilmagoribi noun_prop magorib_1 . . punc ._0