MADA + TOKAN

This is the homepage of the MADA+TOKAN software suite, a natural language processing (NLP) system for tokenization, diacrization, morphological disambiguation, POS tagging, stemming and lemmatization of Arabic.

Overview of the Challenges of Arabic NLP

The Arabic language raises many challenges for NLP. Firstly, Arabic is a morphologically complex language. The morphological analysis of a specific word in a specific context consists of a large number of (partially orthogonal) features, such as basic part-of-speech (i.e., noun, verb, and so on), voice, gender, number, information about the clitics, etc. For Arabic, this gives us about 333,000 theoretically possible completely specified morphological analyses. In contrast, English morphological tagsets usually have about 50 tags, which cover all morphological variations.

In addition, Arabic orthographic rules cause some parts of words to be deleted or modified when cliticization occurs. For example, the Taa-Marbuta (ة) appears as a regular Taa (ت) when followed by a pronominal clitic. Simple segmentation of the pronominal clitic without recovering the Taa-Marbuta could cause unnecessary ambiguity or add to the sparsity problem. Another example of partial word deletion occurs with Arabic's optional diacritics; these characters primarily specify short vowels, but are usually absent in modern text sources, which contributes ambiguity.

Another issue is that the Arabic writing system also shows different levels of specificity in spelling some letters, e.g., an Alef (أ) can be spelled without the Hamza (ء) as (ا). Similarly, a Ya (ي) can be spelled without the dots as an Alef Maksura (ى). The complexity of the morphology together with the underspecification of the orthography create a high degree of ambiguity. On average, a word form in the Penn Arabic Treebank (PATB) has about 12 potential morphological analyses. For example the word والى can be analyzed as any of the following:

Each of these cases has a different diacritization (that is, a specific arrangement of the word with inserted diacritic markers that help to disambiguate the intended analysis).

The MADA Approach

Much work has been done on addressing different specific natural language processing tasks for Arabic, such as:

The MADA system along with TOKAN provide one solution to all of these different problems. Our approach distinguishes between the problems of morphological analysis (what are the different readings of a word out-of-context) and morphological disambiguation (what is the correct reading in a specific context). Once a morphological analysis is chosen in context, we can determine its full POS tag, lemma and diacritization. Morphological analysis and disambiguation are handled in the MADA component of our system. Knowing the morphological analysis also allows us to tokenize and stem deterministically. Since there are many different ways to tokenize Arabic (tokenization is a convention adopted by researchers), the TOKAN component allows the user to specify any tokenization scheme that can be generated from disambiguated analyses. The tokenized version is produced using the Almorgeana (ALMOR) generator.

MADA+TOKAN

MADA (Morphological Analysis and Disambiguation for Arabic) makes use of 24 features to select, for each word, a proper analysis from a list of potential analyses provided by ALMOR. Each ALMOR analysis is scored -- analyses which closely match the feature predictions receive higher scores than those that do not. The analysis score is a normalized sum of feature weights; the weights are tuned so that more predictive features are given stronger weight.

The 24 features consist of 14 morphological features (aspect, case, enclitic, gender, mood, number, person, part-of-speech, 4 levels of proclitics, state, and voice) which MADA predicts using 14 distinct Support Vector Machines (SVMs) trained on the PATB. There are also five additional features which are coarser-grain matches of the enclitic and proclitic features. The final five features capture information such as spelling variations and n-grams statistics.

Since MADA selects a complete analysis from ALMOR, all decisions regarding morphological ambiguity, lexical ambiguity, tokenization, diacritization and POS tagging in any possible POS tagset are made in one fell swoop. The choices not selected are ranked in terms of their score. MADA has a 95% accuracy on basic morphological choice (including tokenization but excluding case/mood/nunation), a 97% accuracy on part-of-speech selection, and 96% accuracy on lemmatization. MADA has a 87% accuracy in predicting full diacritization (including case and mood). MADA is a useful resource not only for NLP applications but also for language learning as its output can be used as a study/reading aid that provides contextual disambiguation.

TOKAN, a general tokenizer for Arabic, provides an easy-to-use resource for creating highly-customized tokenizations of MADA-disambiguated Arabic text. For instance, the decision on whether an Arabic word has a conjunction or preposition clitic is made in MADA; but how such clitics are separated and displayed (accounting for various morphotactics and normalizations) before using them in an application is done with TOKAN. The different types of tokenizations can be used as machine learning features for a variety of applications, such as machine translation, or named-entity recognition.

MADA and TOKAN are packaged and continuously updated. The system is freely available for research purposes. MADA+TOKAN have been used by numerous academic and commercial research institutes around the world, e.g., Univeristy of Washington, Cambridge University, SRI, BBN, Fair Isaac, MIT, RWTH Aachen, Politechnic University of Catalunya (UPC), Copenhagen Business School, National Research Center of Canada, etc. The system has been shown to improve performance in a variety of NLP applications.