A SCALABLE SHALLOW LEARNING APPROACH FOR TAGGING ARABIC NEWS ARTICLES

(Received: 28-Mar.-2020, Revised: 23-Jun.-2020 and 15-Jul.-2020 , Accepted: 16-Jul.-2020)

Authors Leen Al Qadi, Hozayfa El Rifai, Safa Obaid, Ashraf Elnagar,

Keywords #Arabic text classification #Single-label classification #Multi-label classification #Arabic datasets #Shallow learning classifiers

Abstract Text classification is the process of automatically tagging a textual document with the most relevant set of labels. The aim of this work is to automatically tag an input document based on its vocabulary features. To achieve this goal, two large datasets have been constructed from various Arabic news portals. The first dataset consists of 90k single-labeled articles from 4 domains (Business, Middle East, Technology and Sports). The second dataset has over 290k multi-tagged articles. The datasets shall be made freely available to the research community on Arabic computational linguistics. To examine the usefulness of both datasets, we implemented an array of ten shallow learning classifiers. In addition, we implemented an ensemble model to combine best classifiers together in a majority-voting classifier. The performance of the classifiers on the first dataset ranged between 87.7% (Ada-Boost) and 97.9% (SVM). Analyzing some of the misclassified articles confirmed the need for a multi-label opposed to single-label categorization for better classification results. We used classifiers that were compatible with multi-labeling tasks, such as Logistic Regression and XGBoost. We tested the multi-label classifiers on the second larger dataset. A custom accuracy metric, designed for the multi-labeling task, has been developed for performance evaluation along with hamming loss metric. XGBoost proved to be the best multi-labeling classifier, scoring an accuracy of 91.3%, higher than the Logistic Regression score of 87.6%.

References

[1] A. Elnagar and O. Einea, "BRAD 1.0: Book Reviews in Arabic dataset," Proc. of the IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA), pp. 1-8, DOI: 10.1109/AICCSA.2016.7945800, Agadir, Morocco, 2016.

[2] A. Elnagar, Y. Khalifa and A. Einea, "Hotel Arabic-reviews Dataset Construction for Sentiment Analysis Applications," Book Chapter in Intelligent Natural Language Processing: Trends and Applications, pp. 35-52, DOI: 10.1007/978-3-319-67056-0_3, 2017.

[3] A. Elnagar, L. Lulu and O. Einea, "An Annotated Huge Dataset for Standard and Colloquial Arabic Reviews for Subjective Sentiment Analysis," Procedia Computer Science, vol. 142, pp. 182-189, 2018.

[4] N. Boudad, R. Faizi, R. O. Thami and R. Chiheb, "Sentiment Analysis in Arabic: A Review of the Literature," Ain Shams Engineering Journal, vol. 9, pp. 2479-2490, 2017.

[5] A. Dahou, S. Xiong, J. Zhou, M. H. Haddoud and P. Duan, "Word Embeddings and Convolutional Neural Network for Arabic Sentiment Classification," Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers (COLING), pp. 2418–2427, Osaka, Japan, 2016.

[6] H. Almuaidi, S. Alqrainy and A. Ayesh, "Automated Tagging System and Tagset Design for Arabic Text," International Journal of Computational Linguistics Research, vol. 1, pp. 55-62, 2010.

[7] A. Al-Alwani and M. Beseiso, "Arabic Spam Filtering Using Bayesian Model," International Journal of Computer Applications, vol. 79, pp. 11-14, 2013.

[8] Y. Li, X. Nie and R. Huang, "Web Spam Classification Method Based on Deep Belief Networks," Expert Syst. Appl., vol. 96, pp. 261-270, 2018.

[9] S. Malmasi and M. Dras, "Language Identification Using Classifier Ensembles," Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Association for Computational Linguistics, pp. 35–43, Hissar, Bulgaria, 2015.

[10] M. El-Haj, P. Rayson and M. Aboelezz, "Arabic Dialect Identification in the Context of Bivalency and Code-Switching," Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), ELRA, pp. 3622-3627, Miyazaki, Japan, 2018.

[11] N. Y. Habash, Introduction to Arabic Natural Language Processing, Synthesis Lectures on Human Language Technologies, Edited by Graeme Hirst, [Online], Available: https://doi.org/10.2200/S00277ED1V01Y201008HLT010, 2010.

[12] S. C. Dharmadhikari, M. Ingle and P. Kulkarni, "Empirical Studies on Machine Learning Based Text Classification Algorithms," Advanced Computing: An International Journal, vol. 2, pp. 161-169, 2011.

[13] C. C. Aggarwal and C. Zhai, "A Survey of Text Classification Algorithms," Mining Text Data, pp. 163- 222, 2012.

[14] V. Korde and C. N. Mahender, "Text Classification and Classifiers: A Survey," International Journal of Artificial Intelligence & Applications, vol. 3, pp. 85-99, 2012.

[15] I. Hmeidi, M. Al-Ayyoub, N. A. Abdulla, A. A. Almodawar, R. Abooraig and N. A. Mahyoub, "Automatic Arabic Text Categorization: A Comprehensive Comparative Study," Journal of Information Science, vol. 41, no. 1, pp. 114-124, 2015.

[16] A. M. Sbou, "A Survey of Arabic Text Classification Models," International Journal of Informatics and Communication Technology, vol. 8, pp. 25-28, 2019.

[17] M. Saad and W. Ashour, "Arabic Text Classification Using Decision Tree," Proc. of the 12th International Workshop on Computer Science and Information Technologies (CSIT’2010), vol. 2, pp. 75-79, Moscow, Russia, 2010.

[18] F. Harrag, E. El-Qawasmeh and P. Pichappan, "Improving Arabic Text Categorization Using Decision Trees," Proc. of the 1st International Conference on Networked Digital Technologies, pp. 110-115, Ostrava, Czech Republic, 2009.

[19] S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed and A. Alrajeh, "Automatic Arabic Text Classification," JADT 2008: 9es Journées Internationales d’Analyse Statistique des Données Textuelles, pp. 77-83, 2008.

[20] M. E. Kourdi, A. Bensaid and T. Rachidi, "Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm," Workshop on Computational Approaches to Arabic Script-based Languages, DOI: 10.3115/1621804.1621819, 2004.

[21] H. M. Noaman, S. Elmougy, A. Ghoneim and T. T. Hamza, "Naive Bayes Classifier-based Arabic Document Categorization," Proc. of the 7th International Conference on Informatics and Systems (INFOS), pp. 1-5, Cairo, Egypt, 2010.

[22] S. Alsaleem, "Automated Arabic Text Categorization Using SVM and NB," International Arab Journal of e-Technology, vol. 2, no. 2, pp. 124-128, 2011.

[23] M. J. Bawaneh, M. Alkoffash and A. I. Rabea, "Arabic Text Classification Using K-NN and Naive Bayes," Journal of Computer Science, vol. 4, no. 7, pp. 600-605, 2008.

[24] T. F. Gharib, M. B. Habib and Z. T. Fayed, "Arabic Text Classification Using Support Vector Machines," International Journal of Computers and Their Applications, vol. 16, no. 4, pp. 192-199, 2009.

[25] F. Harrag and E. Al-Qawasmah, "Improving Arabic Text Categorization Using Neural Network with SVD," Journal of Digital Information Management, vol. 8, no. 4, pp. 233-239, 2010.

[26] I. Hmeidi, B. Hawashin and E. El-Qawasmeh, "Performance of KNN and SVM Classifiers on Full Word Arabic Articles," Advance Engineering Informatics, vol. 22, no. 1, pp. 106-111, 2008.

[27] S. Boukil, M. Biniz, F. E. Adnani, L. Cherrat and A. E. Moutaouakkil, "Arabic Text Classification Using Deep Learning Techniques," International Journal of Grid and Distributed Computing, vol. 11, pp. 103-114, 2018.

[28] F. A. Zaghoul and S. Al-Dhaheri, "Arabic Text Classification Based on Features Reduction Using Artificial Neural Networks," Proc. of the 15th International Conference on Computer Modelling and Simulation (UKSim), pp. 485-490, Cambridge, UK, 2013.

[29] M. M. Al-Tahrawi and S. N. Al-Khatib, "Arabic Text Classification Using Polynomial Networks," Journal of King Saud University- Computer and Inform. Sciences Archive, vol. 27, pp. 437-449, 2015.

[30] L. Lulu and A. Elnagar, "Automatic Arabic Dialect Classification Using Deep Learning Models," Procedia Computer Science, vol. 142, pp. 262-269, 2018.

[31] A. A. Altowayan and A. Elnagar, "Improving Arabic Sentiment Analysis with Sentiment-specific Embeddings," Proc. of IEEE International Conference on Big Data (Big Data’2017), pp. 4314-4320, Boston, MA, USA, 2017.

[32] A. Elnagar, R. Ismail, B. Alattas and A. Alfalasi, "Automatic Classification of Reciters of Quranic Audio Clips," Proc. of the 15th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), pp. 1-6, Aqaba, Jordan, 2018.

[33] A. Elnagar and M. Lataifeh, "Predicting Quranic Audio Clips Reciters Using Classical Machine Learning Algorithms: A Comparative Study. In book: Recent Advances in NLP: The Case of Arabic Language, vol. 874, pp. 187-209, DOI: 10.1007/978-3-030-34614-0_10, 2020.

[34] A. Elnagar, O. Einea and R. A. Debsi, "Automatic Text Tagging of Arabic News Articles Using Ensemble Deep Learning Models," Proceedings of the 3rd International Conference on Natural Language and Speech Processing (ICNLSP), pp. 59-66, Trento, Italy, 2019.

[35] A. Elnagar, R. Al-Debsi and O. Einea, "Arabic Text Classification Using Deep Learning Models," Information Processing and Management, vol. 57, no. 1, pp. 102-121, 2020.

[36] A. El-Halees, "A Comparative Study on Arabic Text Classification," Egyptian Computer Science Journal, vol. 30, [Online], Available: http://ecsjournal.org/JournalArticle.aspx?articleID=193, 2008.

[37] R. Al-Shalabi and R. Obeidat, "Improving KNN Arabic Text Classification with N-Grams Based Document Indexing," Proc. of the 6th International Conference on Informatics and Systems, pp. 108-112, Cairo, Egypt, 2008.

[38] G. I. Raho, R. Al-Shalabi, G. Kanaan and A. Nassar, "Different Classification Algorithms Based on Arabic Text Classification: Feature Selection Comparative Study," International Journal of Advanced Computer Science and Applications, vol. 6, no. 2, pp. 192-195, 2015.

[39] A. Mesleh, "Chi Square Feature Extraction Based SVMs Arabic Language Text Categorization System," Journal of Computer Science, vol. 3, no. 6, pp. 430-435, 2007.

[40] B. Hawashin, A. Mansour and S. A. Aljawarneh, "An Efficient Feature Selection Method for Arabic Text Classification," Int. Journal of Computer Applications, vol. 83, no. 17, pp. 1-6, 0975-8887, 2013.

[41] N. Alalyani and S. L. Marie-Sainte, "NADA: New Arabic Dataset for Text Classification," International J. of Advanced Comp. Science and Applications, vol. 9, DOI: 10.14569/IJACSA.2018.090928, 2018.

[42] I. A. El-Khair, 1.5 Billion Words Arabic Corpus, Computer Science, Computation and Language, ARXIV, ABS/1611.04033, 2016.

[43] T. Gonçalves and P. Quaresma, "The Impact of NLP Techniques in the Multi-label Text Classification Problem," Intelligent Information Systems, vol. 25, pp. 424-428, 2004.

[44] L. A. Qadi, H. E. Rifai, S. Obaid and A. Elnagar, "Arabic Text Classification of News Articles Using Classical Supervised Classifiers," Proc. of the 2nd International Conference on New Trends in Computing Sciences (ICTCS), pp. 1-6, DOI: 10.1109/ICTCS.2019.8923073, Amman, Jordan, 2019.

[45] A. M. Hassanein and M. Nour, "A Proposed Model of Selecting Features for Classifying Arabic Text," Jordanian Journal of Computers and Information Technology (JJCIT), vol. 5, no. 3, pp. 275-290, 2019.