For reference, here you may find definitions for terms related to MADA+TOKAN, Arabic, and Natural Language Processing in general. This page is meant to help users of MADA who are new to nuances of Arabic NLP.
affix - A morpheme that attaches to the stem of a word, either at the beginning (prefix), end (suffix) or surroundin the stem (circumfix). Unlike clitics, affixes are syntactically and phonologically part of the word they attach to.
ATB (or PATB) - Abbreviation for Penn Arabic Treebank. Also a name for a common tokenization scheme which segments all clitics except the Al+ determiner.
Buckwalter encoding - a strict, one-to-one transliteration of Modern Standard Arabic script characters to ASCII invented by lexicographer Tim Buckwalter. A chart showing the Buckwalter character mapping.
clitic - A morpheme that has syntactic characteristics of a word but is phonologically bound to another word (usually directly, like an affix, but sometime unattached). Proclitics attach to the beginning of the word; enclitics attach to the end.
diacritics - In Arabic, optional symbols used to represent short vowels (vowel diacritics), indefiniteness (nunation diacritics), or consonant doubling (shadda diacritics). In Arabic writing, the inclusion of these symbols is optional; a given word may have all, some or none of the diacritics it possesses when the word is spoken. Diacritization is the process of recovering/generating missing diacritics for undiacritized text. In MADA, the diacritized form (or diac, for short) is the form of the word with all of its diacritics in place.
form-based morphology - the study of the form of units making up words, their interactions with each other and how they relate to the word's overall form.
functional morphology - the study of the function of units inside words and how they affect syntax and semantics.
gloss - A direct, simple and context-free translation of the word as might be found in a dictionary.
lemma - a conventional choice of one word to represent a particular lexeme (set); i.e., one word used to represent all words having the same core meaning and have only infection and clitic differences. Also referred to as the citation form. Lemmatization is the process of mapping a given word to its corresponding lemma.
lexeme - a lexicographic abstraction indicating the set of all words that share a core meaning and differ only in inflection and cliticization.
morpheme - the smallest meaningful unit in a language.
morphological analysis - as a verb, the process by which a word has all its possible morphological analyses determined. In MADA, a morphological analysis consists of the set of morphlogical feature values (the part-of-speech, gender, case, voice, mood, etc.), along with the gloss, lemma and diacritical forms of the word. The full list of the features MADA uses and their possible values can be found in the MADA User Manual.
morphology - the study of internal word structure. Arabic is often described as having a rich morphology since its internal word structures tend to be more complex and more likely to ambiguities (resolvable by context) than those found in other languages such as English.
N-gram language models - A type of probabilistic model used in NLP to predict a word or word feature based on the word or word features that preceed and/or follow it (that is, the word context). MADA uses N-gram models (built using the popular SRI Language Modeling Toolkit) of the lemmas and diacritic word forms to make predictions for those features.
stem - the core part of a word to which affixes and clitics attach.
support vector machines (SVMs) - a class of supervised learning algorithms that can be used to divided data into classes (classification) or other tasks. SVMs were first developed by Vladimir Vapnik. A number of freely available toolkits exist for implementing SVMs; MADA currently uses SVMTools to create predictions of the basic morphological features (part-of-speech, gender, person, case, mood, etc.).
tokenization - the division of a word into clusters of consecutive morphemes (one of which is typically the word stem, and the rest of which are typically inflectional morphemes). The specific division choices are defined by the tokenization scheme. Tokenization schemes typically involve segmentation (the breaking up of the word with the insertion of whitespace), and some amount of orthography regularization. This regularization corrects spelling that becomes erroneous after morphemes are separated.
vocable - a morphological characterization of a set of word forms without considering semantic distinctions. For example, in English, bank (a finanical institution) and bank (a river boundary) represent separate lexeme sets, but are in the same vocable set.