Supervised classification of languages used by Moroccans in social networks
Keywords:Comment, social media, automatic classification, NLP, Morocco Dialect
On social networks, such as Facebook, users's comments cover several languages, thus, knowing the language of a comment could be very valuable for any further processing. Across this paper, we compare the performance of some typical classification approaches applied on our manually annotated dataset. This dataset is composed of Facebook comments of Moroccan users. The classification approaches we have considered in this work are Naive Bayes, Support Vector Machines, K-Nearest Neighbors, Logistic Regression, Gradient Boosting, Random Forest, Decision Trees as well as Multi-layer Perceptron. The results obtained show that the Multi-layer Perceptron algorithm scored the highest success rate (86.79%), followed by the logistic regression (86.71%) and the Naive Bayes (85.64%).
O. F. Zaidan and C. Callison-Burch, “Arabic Dialect Identification,” Comput. Linguist., vol. 40, no. 1, pp. 171–202, 2014, doi: 10.1162/COLI_a_00169.
M. El-Haj, P. Rayson, and M. Aboelezz, “Arabic dialect identification in the context of bivalency and code-switching,” in LREC 2018 - 11th International Conference on Language Resources and Evaluation, 2019, pp. 3622–3627.
F. Huang, “Improved Arabic dialect classification with social media data,” Conf. Proc. - EMNLP 2015 Conf. Empir. Methods Nat. Lang. Process., no. September, pp. 2118–2126, 2015, doi: 10.18653/v1/d15-1254.
O. Moussaoui, Y. El Younoussi, and C. Azroumahli, “Creating a Corpus of Moroccan comments by exploring Facebook,” 2021.
H. Elfardy and M. Diab, Sentence level dialect identification in arabic, vol. 2. 2013.
F. Sadat, F. Kazemi, and A. Farzindar, Automatic Identification of Arabic Language Varieties and Dialects in Social Media. 2014. doi: 10.3115/v1/w14-5904.
K. Darwish, “Arabizi Detection and Conversion to Arabic,” ANLP 2014 - EMNLP 2014 Work. Arab. Nat. Lang. Process. Proc., pp. 217–224, 2014, doi: 10.3115/v1/w14-3629.
S. Malmasi, E. Refaee, and M. Dras, Arabic Dialect Identification Using a Parallel Multidialectal Corpus. 2015. doi: 10.1007/978-981-10-0515-2_3.
A. Alshutayri, E. Atwell, A. Alosaimy, J. Dickins, M. Ingleby, and J. Watson, Arabic Language WEKA-Based Dialect Classifier for Arabic Automatic Speech Recognition Transcripts. 2016. [Online]. Available: https://www.aclweb.org/anthology/W16-4826
P. Mishra and V. Mujadia, Arabic dialect identification for travel and twitter text. 2019. doi: 10.18653/v1/w19-4628.
A. Aliwy, H. Taher, and Z. AboAltaheen, “Arabic Dialects Identification for All Arabic countries,” Proc. Fifth Arab. Nat. Lang. Process. Work., no. December, pp. 302–307, 2020.
H. Nayel, A. Hassan, M. Sobhi, and A. El-Sawy, “Machine Learning-Based Approach for Arabic Dialect Identification,” Proc. Sixth Arab. Nat. Lang. Process. Work., no. April, pp. 287–290, 2021, [Online]. Available: https://www.aclweb.org/anthology/2021.wanlp-1.34
J. Wu, S. Pan, X. Zhu, Z. Cai, P. Zhang, and C. Zhang, “Self-adaptive attribute weighting for Naive Bayes classification,” Expert Syst. Appl., vol. 42, no. 3, 2015, doi: 10.1016/j.eswa.2014.09.019.
F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, no. 85, pp. 2825–2830, 2011, [Online]. Available: http://jmlr.org/papers/v12/pedregosa11a.html
L. K. Ramasamy, S. Kadry, Y. Nam, and M. N. Meqdad, “Performance analysis of sentiments in Twitter dataset using SVM models,” Int. J. Electr. Comput. Eng., vol. 11, no. 3, 2021, doi: 10.11591/ijece.v11i3.pp2275-2284.
A. M. Elmogy, U. Tariq, A. Ibrahim, and A. Mohammed, “Fake Reviews Detection using Supervised Machine Learning,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 1, 2021, doi: 10.14569/IJACSA.2021.0120169.
R. D. Joshi and C. K. Dhakal, “Predicting type 2 diabetes using logistic regression and machine learning approaches,” Int. J. Environ. Res. Public Health, vol. 18, no. 14, 2021, doi: 10.3390/ijerph18147346.
J. R. Quinlan, “Induction of Decision Trees,” Mach. Learn., vol. 1, no. 1, pp. 81–106, 1986, doi: 10.1023/A:1022643204877.
A. Cutler, D. R. Cutler, and J. R. Stevens, “Random forests,” in Ensemble Machine Learning: Methods and Applications, 2012. doi: 10.1007/9781441993267_5.
J. H. Friedman, “Stochastic gradient boosting,” Comput. Stat. Data Anal., vol. 38, no. 4, pp. 367–378, Feb. 2002, doi: 10.1016/S0167-9473(01)00065-2.
T. T. Ngoc, L. van Dai, and D. T. Phuc, “Grid search of multilayer perceptron based on the walk-forward validation methodology,” Int. J. Electr. Comput. Eng., vol. 11, no. 2, 2021, doi: 10.11591/ijece.v11i2.pp1742-1751.
“HPC.” https://www.marwan.ma/index.php/en/services/hpc (accessed May 17, 2022).
How to Cite
Copyright (c) 2022 otman Moussaoui, Yacine El Younoussi, Chaimae Azroumahli
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Copyright on any article in the International Journal of Computer Engineering and Data Science (IJCEDS) is retained by the author(s) under the Creative Commons license, which permits unrestricted use, distribution, and reproduction provided the original work is properly cited.
Authors grant IJCEDS a license to publish the article and identify IJCEDS as the original publisher.
Authors also grant any third party the right to use, distribute and reproduce the article in any medium, provided the original work is properly cited.