Named Entities Recognition (NER) Task zip
ANERCorp: Is a Corpus of more than 150,000 words annotated for the NER task.
ANERGazet: Is a collection of 3 Gazetteers, (i) Locations: a Gazetteer containing names of continents, countries, cities, etc.; (ii) People: a Gazetteer containing names of people recollected manually from different Arabic websites; and finally (iii) Organizations: containing names of Organizations like companies, football teams, etc.
SVM model: An SVM model trained for Arabic Named Entities Recognition on Newswire documents. The input file should be:
1- In romanized characters, using Buckwalter mapping table;
2- With clitics segmented, i.e. tokenized text (Mona Diab's tokenizer can be used for this purpose);
3- One word per column.

You should have YamCha installed and use the following command:

yamcha -m SVMmodel.model < inputFile > outputFile

The output file will contain two columns: first one for words and second one for tags.
Test-Bed for Passage Retrieval (PR) and Question Answering (QA) tasks zip
Documents: more than 11,000 Arabic Wikipedia Articles in SGML format (the format adopted in the CLEF and also the one accepted by the JIRS system).
List of Questions: This is a list of 200 questions of different types. The proportion of each type of questions is the same proportion adopted in CLEF.
List of Correct Answers: For each of the questions presented in my list of questions, I give you here a list of correct answers for each question. This list is very important for automatic evaluation.
Doc -
Arabic language rules (in Arabic): Somebody has mailed me this pps file which summarizes all the Arabic rules, unfortunately there is no English version of the file. I would have translated it myself because it's really worth it but the file contains 812 slides!!.

<- home