A SCALABLE SHALLOW LEARNING APPROACH FOR TAGGING ARABIC NEWS ARTICLES1

(Received: 28-Mar.-2020, Revised: 23-Jun.-2020 and 15-Jul.-2020 , Accepted: 16-Jul.-2020)
Text classification is the process of automatically tagging a textual document with the most relevant set of labels. The aim of this work is to automatically tag an input document based on its vocabulary features. To achieve this goal, two large datasets have been constructed from various Arabic news portals. The first dataset consists of 90k single-labeled articles from 4 domains (Business, Middle East, Technology and Sports). The second dataset has over 290k multi-tagged articles. The datasets shall be made freely available to the research community on Arabic computational linguistics. To examine the usefulness of both datasets, we implemented an array of ten shallow learning classifiers. In addition, we implemented an ensemble model to combine best classifiers together in a majority-voting classifier. The performance of the classifiers on the first dataset ranged between 87.7% (Ada-Boost) and 97.9% (SVM). Analyzing some of the misclassified articles confirmed the need for a multi-label opposed to single-label categorization for better classification results. We used classifiers that were compatible with multi-labeling tasks, such as Logistic Regression and XGBoost. We tested the multi-label classifiers on the second larger dataset. A custom accuracy metric, designed for the multi-labeling task, has been developed for performance evaluation along with hamming loss metric. XGBoost proved to be the best multi-labeling classifier, scoring an accuracy of 91.3%, higher than the Logistic Regression score of 87.6%.
[1] A. Elnagar and O. Einea, "BRAD 1.0: Book Reviews in Arabic dataset," Proc. of the IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA), pp. 1-8, DOI: 10.1109/AICCSA.2016.7945800, Agadir, Morocco, 2016. [2] A. Elnagar, Y. Khalifa and A. Einea, "Hotel Arabic-reviews Dataset Construction for Sentiment Analysis Applications," Book Chapter in Intelligent Natural Language Processing: Trends and Applications, pp. 35-52, DOI: 10.1007/978-3-319-67056-0_3, 2017. [3] A. Elnagar, L. Lulu and O. Einea, "An Annotated Huge Dataset for Standard and Colloquial Arabic Reviews for Subjective Sentiment Analysis," Procedia Computer Science, vol. 142, pp. 182-189, 2018. [4] N. Boudad, R. Faizi, R. O. Thami and R. Chiheb, "Sentiment Analysis in Arabic: A Review of the Literature," Ain Shams Engineering Journal, vol. 9, pp. 2479-2490, 2017. [5] A. Dahou, S. Xiong, J. Zhou, M. H. Haddoud and P. Duan, "Word Embeddings and Convolutional Neural Network for Arabic Sentiment Classification," Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers (COLING), pp. 2418–2427, Osaka, Japan, 2016. [6] H. Almuaidi, S. Alqrainy and A. Ayesh, "Automated Tagging System and Tagset Design for Arabic Text," International Journal of Computational Linguistics Research, vol. 1, pp. 55-62, 2010. [7] A. Al-Alwani and M. Beseiso, "Arabic Spam Filtering Using Bayesian Model," International Journal of Computer Applications, vol. 79, pp. 11-14, 2013. [8] Y. Li, X. Nie and R. Huang, "Web Spam Classification Method Based on Deep Belief Networks," Expert Syst. Appl., vol. 96, pp. 261-270, 2018. [9] S. Malmasi and M. Dras, "Language Identification Using Classifier Ensembles," Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Association for Computational Linguistics, pp. 35–43, Hissar, Bulgaria, 2015. [10] M. El-Haj, P. Rayson and M. Aboelezz, "Arabic Dialect Identification in the Context of Bivalency and Code-Switching," Proceedings of the 11th International Conference on Language Resources and 280 Jordanian Journal of Computers and Information Technology (JJCIT), Vol. 06, No. 03, September 2020. Evaluation (LREC 2018), European Language Resources Association (ELRA), pp. 3622-3627, Miyazaki, Japan, 2018. [11] N. Y. Habash, Introduction to Arabic Natural Language Processing, Synthesis Lectures on Human Language Technologies, Edited by Graeme Hirst, [Online], Available: https://doi.org/10.2200/S00277ED1V01Y201008HLT010, 2010. [12] S. C. Dharmadhikari, M. Ingle and P. Kulkarni, "Empirical Studies on Machine Learning Based Text Classification Algorithms," Advanced Computing: An International Journal, vol. 2, pp. 161-169, 2011. [13] C. C. Aggarwal and C. Zhai, "A Survey of Text Classification Algorithms," Mining Text Data, pp. 163- 222, 2012. [14] V. Korde and C. N. Mahender, "Text Classification and Classifiers: A Survey," International Journal of Artificial Intelligence & Applications, vol. 3, pp. 85-99, 2012. [15] I. Hmeidi, M. Al-Ayyoub, N. A. Abdulla, A. A. Almodawar, R. Abooraig and N. A. Mahyoub, "Automatic Arabic Text Categorization: A Comprehensive Comparative Study," Journal of Information Science, vol. 41, no. 1, pp. 114-124, 2015. [16] A. M. Sbou, "A Survey of Arabic Text Classification Models," International Journal of Informatics and Communication Technology, vol. 8, pp. 25-28, 2019. [17] M. Saad and W. Ashour, "Arabic Text Classification Using Decision Tree," Proc. of the 12th International Workshop on Computer Science and Information Technologies (&6,7