ANT Corpus : An Arabic News Text Collection for Textual Classification

TitleANT Corpus : An Arabic News Text Collection for Textual Classification
Publication TypeConference Paper
Year of Publication2017
AuthorsChouigui, A, Ben Khiroun, O, Elayeb, B
Conference NameProceedings of the 14th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2017), IEEE Computer Society, Hammamet Tunisia, October 30th to November 3rd 2017, pp.135-142.
KeywordsArabic language, NB., RSS crawling, standard Arabic corpus, SVM, text classification, TREC format

We propose in this paper a new online Arabic corpus of news articles, named ANT Corpus, which is collected from RSS Feeds. Each document represents an article structured in the standard XML TREC format. We use the ANT Corpus for Text Classification (TC) by applying the SVM and Naive Bayes (NB) classifiers to assign to each article its accurate predefined category. We study also in this work the contribution of terms weighting, stop-words removal and light stemming on Arabic TC. The experimental results prove that the text length affects considerably the TC accuracy and that titles words are not sufficiently significant to perform good classification rates. As a conclusion, the SVM method gives the best results of classification of both titles and texts parts.