Supervised classification of languages used by Moroccans in social networks

Authors

  • otman Moussaoui SIGL Laboratory, ENSA Tetuan, UAE
  • Yacine El Younoussi SIGL Laboratory, ENSA Tetuan, UAE
  • Chaimae Azroumahli SIGL Laboratory, ENSA Tetuan, UAE

Keywords:

Comment, social media, automatic classification, NLP, Morocco Dialect

Abstract

On social networks, such as Facebook, users's comments cover several languages, thus, knowing the language of a comment could be very valuable for any further processing. Across this paper, we compare the performance of some typical classification approaches applied on our manually annotated dataset. This dataset is composed of Facebook comments of Moroccan users. The classification approaches we have considered in this work are Naive Bayes, Support Vector Machines, K-Nearest Neighbors, Logistic Regression, Gradient Boosting, Random Forest, Decision Trees as well as Multi-layer Perceptron. The results obtained show that the Multi-layer Perceptron algorithm scored the highest success rate (86.79%), followed by the logistic regression (86.71%) and the Naive Bayes (85.64%).

Downloads

Download data is not yet available.

References

O. F. Zaidan and C. Callison-Burch, “Arabic Dialect Identification,” Comput. Linguist., vol. 40, no. 1, pp. 171–202, 2014, doi: 10.1162/COLI_a_00169.

M. El-Haj, P. Rayson, and M. Aboelezz, “Arabic dialect identification in the context of bivalency and code-switching,” in LREC 2018 - 11th International Conference on Language Resources and Evaluation, 2019, pp. 3622–3627.

F. Huang, “Improved Arabic dialect classification with social media data,” Conf. Proc. - EMNLP 2015 Conf. Empir. Methods Nat. Lang. Process., no. September, pp. 2118–2126, 2015, doi: 10.18653/v1/d15-1254.

O. Moussaoui, Y. El Younoussi, and C. Azroumahli, “Creating a Corpus of Moroccan comments by exploring Facebook,” 2021.

H. Elfardy and M. Diab, Sentence level dialect identification in arabic, vol. 2. 2013.

F. Sadat, F. Kazemi, and A. Farzindar, Automatic Identification of Arabic Language Varieties and Dialects in Social Media. 2014. doi: 10.3115/v1/w14-5904.

K. Darwish, “Arabizi Detection and Conversion to Arabic,” ANLP 2014 - EMNLP 2014 Work. Arab. Nat. Lang. Process. Proc., pp. 217–224, 2014, doi: 10.3115/v1/w14-3629.

S. Malmasi, E. Refaee, and M. Dras, Arabic Dialect Identification Using a Parallel Multidialectal Corpus. 2015. doi: 10.1007/978-981-10-0515-2_3.

A. Alshutayri, E. Atwell, A. Alosaimy, J. Dickins, M. Ingleby, and J. Watson, Arabic Language WEKA-Based Dialect Classifier for Arabic Automatic Speech Recognition Transcripts. 2016. [Online]. Available: https://www.aclweb.org/anthology/W16-4826

P. Mishra and V. Mujadia, Arabic dialect identification for travel and twitter text. 2019. doi: 10.18653/v1/w19-4628.

A. Aliwy, H. Taher, and Z. AboAltaheen, “Arabic Dialects Identification for All Arabic countries,” Proc. Fifth Arab. Nat. Lang. Process. Work., no. December, pp. 302–307, 2020.

H. Nayel, A. Hassan, M. Sobhi, and A. El-Sawy, “Machine Learning-Based Approach for Arabic Dialect Identification,” Proc. Sixth Arab. Nat. Lang. Process. Work., no. April, pp. 287–290, 2021, [Online]. Available: https://www.aclweb.org/anthology/2021.wanlp-1.34

J. Wu, S. Pan, X. Zhu, Z. Cai, P. Zhang, and C. Zhang, “Self-adaptive attribute weighting for Naive Bayes classification,” Expert Syst. Appl., vol. 42, no. 3, 2015, doi: 10.1016/j.eswa.2014.09.019.

F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, no. 85, pp. 2825–2830, 2011, [Online]. Available: http://jmlr.org/papers/v12/pedregosa11a.html

L. K. Ramasamy, S. Kadry, Y. Nam, and M. N. Meqdad, “Performance analysis of sentiments in Twitter dataset using SVM models,” Int. J. Electr. Comput. Eng., vol. 11, no. 3, 2021, doi: 10.11591/ijece.v11i3.pp2275-2284.

A. M. Elmogy, U. Tariq, A. Ibrahim, and A. Mohammed, “Fake Reviews Detection using Supervised Machine Learning,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 1, 2021, doi: 10.14569/IJACSA.2021.0120169.

R. D. Joshi and C. K. Dhakal, “Predicting type 2 diabetes using logistic regression and machine learning approaches,” Int. J. Environ. Res. Public Health, vol. 18, no. 14, 2021, doi: 10.3390/ijerph18147346.

J. R. Quinlan, “Induction of Decision Trees,” Mach. Learn., vol. 1, no. 1, pp. 81–106, 1986, doi: 10.1023/A:1022643204877.

A. Cutler, D. R. Cutler, and J. R. Stevens, “Random forests,” in Ensemble Machine Learning: Methods and Applications, 2012. doi: 10.1007/9781441993267_5.

J. H. Friedman, “Stochastic gradient boosting,” Comput. Stat. Data Anal., vol. 38, no. 4, pp. 367–378, Feb. 2002, doi: 10.1016/S0167-9473(01)00065-2.

T. T. Ngoc, L. van Dai, and D. T. Phuc, “Grid search of multilayer perceptron based on the walk-forward validation methodology,” Int. J. Electr. Comput. Eng., vol. 11, no. 2, 2021, doi: 10.11591/ijece.v11i2.pp1742-1751.

“HPC.” https://www.marwan.ma/index.php/en/services/hpc (accessed May 17, 2022).

Downloads

Published

2022-12-31

How to Cite

Moussaoui, otman, El Younoussi, . Y., & Azroumahli, . C. (2022). Supervised classification of languages used by Moroccans in social networks . International Journal of Computer Engineering and Data Science (IJCEDS), 2(4), 1–10. Retrieved from http://ijceds.com/ijceds/article/view/41