MADA+TOKAN:
A System for Arabic Tokenization,
Diacritization,
Morphological
Disambiguation, POS Tagging,
Stemming and Lemmatization
|
1.
Arabic Processing Challenges 4.
MADA+TOKAN Manual (with examples) 6.
Publications (About and Using MADA+TOKAN) 7.
Contacts: o Nizar Habash and Owen Rambow (<lastname> AT ccls DOT
columbia DOT edu) |
The Arabic language raises many challenges for natural
language processing (NLP). First, Arabic is a morphologically complex language.
The morphological analysis of a word consists of determining the values of a
large number of (partially orthogonal) features, such as basic part-of-speech
(i.e., noun, verb, and so on), voice, gender, number, information about the
clitics, and so on. For Arabic,
this gives us about 333,000 theoretically possible completely specified
morphological analyses. In contrast, English morphological tagsets usually have
about 50 tags, which cover all morphological variations. Second, Arabic
orthographic rules cause some parts of words to be deleted or modified when
cliticization occurs. For example, the Taa-Marbuta appears as a regular Taa
when followed by a pronominal clitic.
Simple segmentation of the pronominal clitic without recovering the
Taa-Marbuta could cause unnecessary ambiguity or add to the sparsity problem.
Third, Arabic is written with optional diacritics that primarily specify short
vowels; they are usually absent, which contributes ambiguity. Finally, the writing system also shows
different levels of specificity in spelling some letters, e.g. أ can be
spelled without the Hamza (ء) as ا and
ي can be spelled without the dots as ى. The complexity of the morphology together
with the underspecification of the orthography create a high degree of
ambiguity. On average, a word form in the Penn Arabic Treebank (PATB; Maamouri
et al 2004) has about 12 morphological analyses. For example the word والى can be
analyzed as والي `ruler',
و+الى+ي `and to me', و+ألي
`and I follow',
و+آل+ي`and my clan' or و+آلي `and automatic'. Each of these cases has a different
diacritization.
Much work has been done on addressing
different specific natural language processing tasks for Arabic, such
as tokenization, diacritization, morphological disambiguation,
part-of-speech (POS) tagging, stemming and lemmatization. (The papers cited below
contain a discussion of relevant work.) The MADA system along with TOKAN provide one
solution to all of these different problems. Our approach distinguishes between the problems of
morphological analysis (what are
the different readings of a word out-of-context) and morphological
disambiguation (what is
the correct reading in a specific context). Once a morphological analysis
is chosen in context, we can determine its full POS tag, lemma and
diacritization. Morphological analysis and disambiguation are handled
in the MADA component of our system. Knowing the morphological analysis also allows us
to tokenize and stem deterministically. Since there are many different
ways to tokenize Arabic (tokenization is a convention adopted by
researchers), the TOKAN component allows the user to specify any
tokenization scheme that can be generated from disambiguated
analyses. The tokenized version is produced using the ARAGEN generator
(Habash 2004).
MADA (Morphological Analysis and Disambiguation for
Arabic) makes use of 19 orthogonal features to select, for each word, a proper analysis
from a list of potential analyses provided by the Buckwalter Arabic
Morphological Analyzer (BAMA; Buckwalter 2004). The BAMA analysis
which matches the most of the predicted features wins. These 19 features
consist of the 14 morphological features, e.g. number, gender, case, mood,
etc., which MADA predicts using 14 distinct Support Vector Machines trained on
the PATB. In addition, MADA uses
five features capturing spelling variations and n-grams statistics among
others. Since MADA selects a
complete analysis from BAMA, all decisions regarding morphological ambiguity,
lexical ambiguity, tokenization, diacritization and POS tagging in any possible
POS tagset are made in one fell swoop (Habash and Rambow, 2005; Habash and
Rambow 2007; Roth et al, 2008).
The choices not selected are ranked in terms of their score. MADA has over 96% accuracy on basic
morphological choice (including tokenization but excluding case/mood/nunation)
and on lemmatization. MADA has over 86% accuracy in predicting full
diacritization (including case and mood).
MADA is a useful resource not only for NLP applications but also for
language learning as its output can be used as a study/reading aid that
provides contextual disambiguation. Detailed comparative evaluations are
provided in the following publications:
(Habash and Rambow, 2005; Habash and Rambow 2007; Roth et al,
2008).
TOKAN, a general tokenizer for Arabic, provides an
easy-to-use resource for tokenizing MADA disambiguated Arabic text into a large
set of possibilities (Habash 2007). For instance, the decision on whether an
Arabic word has a conjunction or preposition clitic is made in MADA; but how
such clitics are separated (accounting for various morphotactics and
normalizations) before using them in an application is done with TOKAN. The
different types of tokenizations can be used as machine learning features for a
variety of applications, such as machine translation, or named-entity
recognition.
MADA
and TOKAN are packaged and continuously updated. The system is freely
available for research purposes for free. MADA+TOKAN have been used by numerous academic
and commercial research institutes around the world, e.g., U
Washington, Cambridge University, SRI, BBN, Fair Isaac, MIT, RWTH
Aachen, Politechnic University of Catalunya (UPC), Copenhagen Business
School, National Research Center of Canada, etc. The system has been shown to
improve performance in a variety of NLP
applications.
About MADA+TOKAN
Roth, Ryan,
Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. Arabic Morphological
Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature
Ranking. In Proceedings of Association for Computational
Linguistics (ACL), Columbus, Ohio. 2008. [PDF]
Habash, Nizar
and Rambow, Owen, 2007. Arabic Diacritization through Full
Morphological Tagging. In Human Language Technologies 2007: The
Conference of the North American Chapter of the Association for Computational
Linguistics (NAACL HLT 2007); Companion Volume, Short
Papers. [PDF]
Habash, Nizar
and Owen Rambow. Arabic Tokenization, Morphological Analysis, and
Part-of-Speech Tagging in One Fell Swoop. In Proceedings
of the Conference of American Association for Computational Linguistics
(ACL05). [PDF]
Habash, Nizar.
"Arabic Morphological Representations for Machine Translation." Book
Chapter. In Arabic Computational Morphology: Knowledge-based and Empirical
Methods. Editors Antal van den Bosch and Abdelhadi Soudi. Kluwer/Springer Publications, 2007.
Using MADA+TOKAN
Badr,
Ibrahim, Rabih Zbib and
James Glass. Segmentation for English-to-Arabic
Statistical Machine Translation. In Proceedings of the Conference of Association
for Computational Linguistics (ACL), Columbus, Ohio. 2008. [PDF]
Benajiba,
Yassine, Mona Diab and Paolo Rosso.
Arabic Named Entity Recognition using Optimized Feature Sets. In Proceedings of the 2008 Conference on Empirical Methods in
Natural Language Processing, Honolulu, October 2008. [PDF]
Costa-jussa,
Marta R. , Josep M. Crego, Adria de Gispert, Patrik
Lambert, Maxim Khalilov, Jose A.R.
Fonollosa, Jose B. Marino and Rafael Banchs. TALP Phrase-Based System and TALP
System Combination for IWSLT 2006. In Proceedings of the
International Workshop on Spoken Language Translation, Kyoto, Japan, 2006.
[PDF]
Crego , Josep M., Adria de
Gispert, Patrik Lambert, Maxim Khalilov, Marta R. Costa-jussa, Jose B. Marino,
Rafael Banchs and Jose A.R. Fonollosa. The TALP Ngram-based SMT System for
IWSLT 2006. In Proceedings of the International Workshop on
Spoken Language Translation, Kyoto, Japan, 2006. [PDF]
Crego, Josep
M. and Nizar Habash. Using Shallow Syntax Information to Improve Word
Alignment and Reordering for SMT. In Proceedings of the
Statistical Machine Translation Workshop at the Conference of Association for
Computational Linguistics (ACL), Columbus, Ohio. 2008. [PDF]
Diab, Mona,
Mahmoud Ghoneim and Nizar Habash. Arabic Diacritization in the Context of
Statistical Machine Translation, In Proceedings of the Machine Translation
Summit (MT-Summit), Copenhagen, Denmark, 2007. (PDF)
Elming, Jakob
and Nizar Habash. Combination of Statistical Word Alignments Based on
Multiple Preprocessing Schemes, In Proceedings of the North American
chapter of the Association for Computational Linguistics (NAACL), Rochester,
New York, 2007. [PDF]
Elming,
Jakob, Nizar Habash and Josep Crego. Combination of
Statistical Word Alignments Based on Multiple Preprocessing Schemes.
Book chapter in Learning for Machine Translation.
Editors Cyril Goutte, Nicola Cancedda, Marc Dymetman, and George Foster. MIT
Press, 2008.
Farber,
Benjamin, Dayne Freitag, Nizar Habash and Owen Rambow. Improving NER in
Arabic Using a Morphological Tagger. In Proceedings of
the Language Resources and Evaluation Conference (LREC), Marrakech, Morocco,
2008.
Habash,
Nizar. Syntactic Preprocessing for Statistical Machine Translation, In Proceedings
of the Machine Translation Summit (MT-Summit), Copenhagen, Denmark, 2007. (PDF)
Habash, Nizar
and Fatiha Sadat. Arabic Preprocessing Schemes for Statistical Machine
Translation, In Proceedings of the North American Chapter of the
Association for Computational Linguistics (NAACL), New York, 2006. [PDF]
Sadat, Fatiha
and Nizar Habash. Combination of Preprocessing Schemes for
Statistical MT. In Proceedings of COLING-ACL,
Sydney, Australia, 2006. [PDF]
Snider, Neal
and Mona Diab. Unsupervised Induction of Modern
Standard Arabic Verb Classes Using Syntactic Frames and LSA. In
Proceedings of COLING-ACL, Sydney, Australia, 2006 [PDF]
Vilar, David, Daniel Stein, Yuqi Zhang, Evgeny Matusov, Arne Mauser,
Oliver Bender, Saab Mansour and Hermann Ney. The RWTH Machine Translation System
for IWSLT 2008. In Proceedings of the International
Workshop on Spoken Language Translation, Waikiki, Hawai'i, 2008. [PDF]
If you used
MADA+TOKAN and cited our papers, please let us know to add your publication
here.
References
Buckwalter, Tim .
Buckwalter Arabic Morphological Analyzer Version 2.0. 2004.
Linguistic Data Consortium (LDC) catalog number LDC2004L02, ISBN
1-58563-324-0.
Habash, Nizar.
Large Scale Lexeme Based Arabic Morphological Generation.
In Proceedings of Traitement Automatique du Langage Naturel
(TALN-04). Fez, Morocco, 2004. [PDF]
Habash, Nizar. 2007. "Arabic Morphological Representations for
Machine Translation." Book Chapter. In Arabic Computational Morphology:
Knowledge-based and Empirical Methods. Editors Antal van den Bosch and
Abdelhadi Soudi. Kluwer/Springer Publications.
Habash,
Nizar and Owen Rambow. 2005. Arabic
Tokenization, Morphological Analysis, and Part-of-Speech Tagging
in One Fell Swoop. In Proceedings of ACL'05.
Habash, Nizar and Owen Rambow. 2007. Arabic Diacritization through Full
Morphological Tagging. In Proceedings of NAACL-HLT'07.
Mohamed Maamouri, Ann
Bies, Tim Buckwalter, Wigdan Mekki.
2004. The Penn Arabic
Treebank: Building a Large-Scale Annotated Arabic Corpus. In NEMLAR Conference on Arabic Language
Resources and Tools.
Roth, Ryan, Owen Rambow, Nizar Habash,
Mona Diab and Cynthia Rudin.
2008. Arabic Morphological Tagging, Diacritization, and Lemmatization Using
Lexeme Models and Feature Ranking. In Proceedings of
ACL'08.