Optimasi algoritma deteksi spam email dengan BERT-MI dan jaringan dense
DOI:
 
							
								https://doi.org/10.37859/coscitech.v6i2.9460
							
						
					Abstract
Email spam detection is a critical challenge in maintaining the security and efficiency of digital communication. This research proposes and evaluates an optimized pipeline for email spam detection by integrating Bidirectional Encoder Representations from Transformers (BERT) for feature extraction, Mutual Information (MI) for feature selection to reduce dimensionality, and a dense neural network for classification. The Lingspam dataset, consisting of 2893 emails (2412 ham and 481 spam), was used in the experiments with an 80% training and 20% testing data split. Text features were extracted using BERT (bert-base-uncased), resulting in a 768-dimensional embedding, which was then reduced to the 200 most relevant features using MI. A dense neural network model with a 256-128-64-32-1 neuron architecture was trained using the Adam optimizer, binary cross-entropy loss function, and techniques such as early stopping and class weights to handle class imbalance. Evaluation results on the test data demonstrated very high performance, achieving an accuracy of 99.14%, precision of 0.9596, recall of 0.9896, F1-score of 0.9744, and ROC-AUC of 0.9995. This approach indicates that the combination of BERT-MI with a dense network can achieve accuracy comparable to more complex methods, but with the potential for a simpler and more efficient architecture.
Downloads
References
[2] J. Al Amien, H. Mukhtar, and M. A. Rucyat, “Filtering spam email menggunakan algoritma naïve bayes,” J. Comput. Sci. Inf. Technol., vol. 3, no. 1, pp. 9–19, 2022.
[3] A. G. West and I. Lee, “Towards the Effective Temporal Association Mining of Spam Blacklists,” in CEAS 11 (Conference on Email and Anti-Spam), 2011.
[4] K. I. Roumeliotis, N. D. Tselikas, and N. K. Dimitrios, “Next-Generation Spam Filtering : Comparative Fine-Tuning of LLMs , NLPs , and CNN Models for Email Spam Classification,” Electronics, vol. 13, no. 11, pp. 1–24, 2024.
[5] J. Sureda-negre, A. Calvo-sastre, and R. Comas-forgas, “Predatory journals and publishers : Characteristics and impact of academic spam to researchers in educational sciences,” Learn. Publ., vol. 35, no. January, pp. 441–447, 2022, doi: 10.1002/leap.1450.
[6] G. Nasreen, M. Murad Khan, M. Younus, B. Zafar, and M. Kashif Hanif, “Email spam detection by deep learning models using novel feature selection technique and BERT,” Egypt. Informatics J., vol. 26, no. January, p. 100473, 2024, doi: 10.1016/j.eij.2024.100473.
[7] T. Um, G. Kim, W. Lee, B. Oh, B. Seo, and M. Kweun, “FastFlow : Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline,” PVLDB (Proceedings VLDB Endowment), vol. 16, no. 5, pp. 1086–1099, 2023, doi: 10.14778/3579075.3579083.
[8] J. Devlin, M. C. Kenton, L. Kristina, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” no. Mlm, 2019.
[9] A. Vaswani et al., “Attention Is All You Need,” 31st Conf. Neural Inf. Process. Syst. (NIPS 2017), 2017.
[10] F. Salim et al., “Klasifikasi Berita Palsu Menggunakan Pendekatan Hybrid CNN-LSTM,” J. Comput. Sci. Inf. Technol., vol. 6, no. 1, pp. 55–59, 2025.
[11] Liu, Shiyu, and Mehul Motani. “Improving Mutual Information Based Feature Selection by Boosting Unique Relevance.” Journal of Artificial Intelligence Research, vol. 82, 12 Mar. 2025, pp. 1267–1292, www.jair.org/index.php/jair/article/view/17219, https://doi.org/10.1613/jair.1.17219. Accessed 18 Mar. 2025.
						









