Deteksi Spam Email Multibahasa: Menggunakan Cross-lingual Transfer Learning
DOI:
https://doi.org/10.37859/coscitech.v6i3.10107
Abstract
Targeting the challenge of text classification in Indonesian, which often faces a scarcity of adequate labeled data, this research adapts the pre-trained language model BERT-base-multilingual-cased, which was trained on a large multilingual corpus. The strategy involves two stages: first, the model is fine-tuned on a rich English-language spam dataset, and second, the trained model is then further fine-tuned using a much smaller Indonesian-language dataset. Quantitative evaluation results show that the model achieved very good and consistent performance in both languages. On the English dataset, the model reached an Accuracy of 0.9738 and an F1-score of 0.9436. More significantly, on the Indonesian dataset, the model achieved an Accuracy of 0.9492 with an F1-score of 0.9494. The comparable performance between the two languages, despite the Indonesian dataset being much smaller, proves that the semantic knowledge acquired from the source language (English) can be efficiently transferred for the same classification task in the target language (Indonesian). This research provides a strong demonstration of how transfer learning can bridge the data resource gap and has important implications for the development of NLP applications in the context of low-resource languages
Downloads
References
JÁÑEZ-MARTINO, Francisco, et al. A review of spam email detection: analysis of spammer strategies and the dataset shift problem. Artificial Intelligence Review, 2022, 56: 1145-1173. https://doi.org/10.1007/s10462-022-10195-4
ZHANG, Ziqi, et al. MMTD: A Multilingual and Multimodal Spam Detection Model Combining Text and Document Images. Applied Sciences, 2023. https://doi.org/10.3390/app132111783
LABONNE, Maxime; MORAN, Sean J. Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection. ArXiv, 2023, abs/2304.01238. https://doi.org/10.48550/arXiv.2304.01238
GARZÓ, A., et al. Cross-lingual web spam classification. Proceedings of the 22nd International Conference on World Wide Web, 2013. https://doi.org/10.1145/2487788.2488139
TAHA, Kamal SMART: Semantic, Multi-Objective, and Reinforcement-Based Adversarial Training for Email Spam Detection. IEEE Access, 2025, 13: 112749-112764. https://doi.org/10.1109/ACCESS.2025.3581131
KHAN, Wazir Zada, et al. A Comprehensive Study of Email Spam Botnet Detection. IEEE Communications Surveys & Tutorials, 2015, 17: 2271-2295. https://doi.org/10.1109/COMST.2015.2459015
LIU, Xiaoxu; LU, Haoye; NAYAK, A. A Spam Transformer Model for SMS Spam Detection. IEEE Access, 2021, 9: 80253-80263. https://doi.org/10.1109/ACCESS.2021.3081479
VERNANDA, Yustinus; HANSUN, S.; KRISTANDA, M. B. Indonesian language email spam detection using N-gram and Naïve Bayes algorithm. Bulletin of Electrical Engineering and Informatics, 2020. https://doi.org/10.11591/EEI.V9I5.2444
JAIN, P.; SINGH, Shivang; SAXENA, Chaitanya Kumar. Detecting Email Spam with NLP: A Machine Learning Approach. 2024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT), 2024, 5: 393-398. https://doi.org/10.1109/IC2PCT60090.2024.10486769
MUSTAPHA, Ismail B., et al. Effective Email Spam Detection System using Extreme Gradient Boosting. ArXiv, 2020, abs/2012.14430. https://arxiv.org/abs/2012.14430
ROSARIO, Vance I. Del; FERNANDEZ, Benjamin David P.; PADILLA, Dionis A. Email Spam Classification using DistilBERT. 2023 IEEE 15th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM), 2023.
https://doi.org/10.1109/HNICEM60674.2023.10589211
ZHELEVA, E.; KOLCZ, A.; GETOOR, L. Trusting spam reporters: A reporter-based reputation system for email filtering. ACM Trans. Inf. Syst., 2008, 27: 31-327. https://doi.org/10.1145/1416950.1416953
THAKUR, Prazwal, et al. Detection of Email Spam using Machine Learning Algorithms: A Comparative Study. 2022 8th International Conference on Signal Processing and Communication (ICSC), 2022. https://doi.org/10.1109/ICSC56524.2022.10009149
NISAR, Naina; RAKESH, N.; CHHABRA, Megha. Voting-Ensemble Classification for Email Spam Detection. 2021 International Conference on Communication information and Computing Technology (ICCICT), 2021. https://doi.org/10.1109/ICCICT50803.2021.9510066
LEE, Hwabin, et al. Visualization Technology and Deep-Learning for Multilingual Spam Message Detection. Electronics, 2023. https://doi.org/10.3390/electronics12030582
AMIN, Ruhul; RAHMAN, Md. Moshiur; HOSSAIN, Nahid. A Bangla Spam Email Detection and Datasets Creation Approach based on Machine Learning Algorithms. 2019 3rd International Conference on Electrical, Computer & Telecommunication Engineering (ICECTE), 2019. https://doi.org/10.1109/ICECTE48615.2019.9303525
IQBAL, Kashif, et al. Improving Spam Detection for German Users: A Machine Learning Approach to German Email Classification. Kashf Journal of Multidisciplinary Research, 2025. https://doi.org/10.71146/kjmr487










