Optimasi algoritma deteksi spam email dengan BERT-MI dan jaringan dense

Authors

  • Florentina Yuni Arini Universitas Negeri Semarng

DOI:

https://doi.org/10.37859/coscitech.v6i2.9460
Keywords: email spam detection, BERT, mutual information, dense neural network, lingspam email spam detection, BERT, mutual information, dense neural network, lingspam

Abstract

Email spam detection is a critical challenge in maintaining the security and efficiency of digital communication. This research proposes and evaluates an optimized pipeline for email spam detection by integrating Bidirectional Encoder Representations from Transformers (BERT) for feature extraction, Mutual Information (MI) for feature selection to reduce dimensionality, and a dense neural network for classification. The Lingspam dataset, consisting of 2893 emails (2412 ham and 481 spam), was used in the experiments with an 80% training and 20% testing data split. Text features were extracted using BERT (bert-base-uncased), resulting in a 768-dimensional embedding, which was then reduced to the 200 most relevant features using MI. A dense neural network model with a 256-128-64-32-1 neuron architecture was trained using the Adam optimizer, binary cross-entropy loss function, and techniques such as early stopping and class weights to handle class imbalance. Evaluation results on the test data demonstrated very high performance, achieving an accuracy of 99.14%, precision of 0.9596, recall of 0.9896, F1-score of 0.9744, and ROC-AUC of 0.9995. This approach indicates that the combination of BERT-MI with a dense network can achieve accuracy comparable to more complex methods, but with the potential for a simpler and more efficient architecture.

Downloads

Download data is not yet available.

References

[1] N. Bouchareb and I. Morad, “ANALYZING THE IMPACT OF AI-GENERATED EMAIL MARKETING CONTENT ON EMAIL DELIVERABILITY IN SPAM FOLDER PLACEMENT,” HOLISTICA J. Bus. Public Adm., vol. 15, no. 1, pp. 96–106, 2024, doi: 10.2478/hjbpa-2024-0006.
[2] J. Al Amien, H. Mukhtar, and M. A. Rucyat, “Filtering spam email menggunakan algoritma naïve bayes,” J. Comput. Sci. Inf. Technol., vol. 3, no. 1, pp. 9–19, 2022.
[3] A. G. West and I. Lee, “Towards the Effective Temporal Association Mining of Spam Blacklists,” in CEAS 11 (Conference on Email and Anti-Spam), 2011.
[4] K. I. Roumeliotis, N. D. Tselikas, and N. K. Dimitrios, “Next-Generation Spam Filtering : Comparative Fine-Tuning of LLMs , NLPs , and CNN Models for Email Spam Classification,” Electronics, vol. 13, no. 11, pp. 1–24, 2024.
[5] J. Sureda-negre, A. Calvo-sastre, and R. Comas-forgas, “Predatory journals and publishers : Characteristics and impact of academic spam to researchers in educational sciences,” Learn. Publ., vol. 35, no. January, pp. 441–447, 2022, doi: 10.1002/leap.1450.
[6] G. Nasreen, M. Murad Khan, M. Younus, B. Zafar, and M. Kashif Hanif, “Email spam detection by deep learning models using novel feature selection technique and BERT,” Egypt. Informatics J., vol. 26, no. January, p. 100473, 2024, doi: 10.1016/j.eij.2024.100473.
[7] T. Um, G. Kim, W. Lee, B. Oh, B. Seo, and M. Kweun, “FastFlow : Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline,” PVLDB (Proceedings VLDB Endowment), vol. 16, no. 5, pp. 1086–1099, 2023, doi: 10.14778/3579075.3579083.
[8] J. Devlin, M. C. Kenton, L. Kristina, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” no. Mlm, 2019.
[9] A. Vaswani et al., “Attention Is All You Need,” 31st Conf. Neural Inf. Process. Syst. (NIPS 2017), 2017.
[10] F. Salim et al., “Klasifikasi Berita Palsu Menggunakan Pendekatan Hybrid CNN-LSTM,” J. Comput. Sci. Inf. Technol., vol. 6, no. 1, pp. 55–59, 2025.
[11] Liu, Shiyu, and Mehul Motani. “Improving Mutual Information Based Feature Selection by Boosting Unique Relevance.” Journal of Artificial Intelligence Research, vol. 82, 12 Mar. 2025, pp. 1267–1292, www.jair.org/index.php/jair/article/view/17219, https://doi.org/10.1613/jair.1.17219. Accessed 18 Mar. 2025.

Downloads

Published

2025-09-13

How to Cite

Florentina Yuni Arini. (2025). Optimasi algoritma deteksi spam email dengan BERT-MI dan jaringan dense. Jurnal CoSciTech (Computer Science and Information Technology), 6(2), 319–328. https://doi.org/10.37859/coscitech.v6i2.9460