Prediksi Lead Scoring untuk Optimasi Penjualan Menggunakan Random Forest dan Teknik SMOTE

Daffa Pratama Putra; Dimas  Agil Kusuma; M. Rizki Al Akbar; Ali Ibrahim; Fathoni Fathoni

doi:10.37859/jf.v16i1.11292

Authors

Daffa Pratama Putra Universitas Sriwijaya
Dimas Agil Kusuma Universitas Sriwijaya
M. Rizki Al Akbar Universitas Sriwijaya
Ali Ibrahim Universitas Sriwijaya
Fathoni Fathoni Universitas Sriwijaya

DOI:

https://doi.org/10.37859/jf.v16i1.11292

Keywords: Lead Scoring, Random Forest, SMOTE, class imbalance, Customer Relationship Management

Abstract

Accurate lead scoring systems have become a strategic necessity for organizations operating in data-driven marketing environments, as they enable systematic identification of high-value customer prospects to maximize sales conversion efficiency. A fundamental challenge confronting conventional classification models is the class imbalance inherent in real-world marketing data, which induces majority-class bias and substantially reduces sensitivity toward minority-class prospects. This study proposes a Random Forest (RF)-based lead scoring prediction model integrated with the Synthetic Minority Over-sampling Technique (SMOTE) to address this limitation systematically. The dataset employed is the Lead Scoring Dataset from Kaggle, comprising 9,240 customer prospect records from an educational company with a class imbalance ratio of 1.59:1. Preprocessing included missing value treatment, removal of attributes exceeding 40% data loss, mode-based imputation, and categorical feature encoding. Following an 80:20 stratified split, SMOTE was applied exclusively to the training set to produce a balanced class distribution and prevent data leakage. The RF model was configured with n_estimators = 100, max_features = 'sqrt', and class_weight = 'balanced'. The proposed RF+SMOTE model achieved accuracy of 88.80%, precision of 86.44%, recall of 84.13%, F1-Score of 85.27%, and AUC-ROC of 0.9453, outperforming the baseline across four of five evaluation metrics. The most notable improvement was observed in recall, with a gain of 1.26 percentage points. Stratified 5-Fold Cross-Validation confirmed robust generalization capability, with AUC-ROC values consistently ranging between 94% and 95%. These findings demonstrate that the hybrid RF+SMOTE approach effectively enhances high-potential prospect detection while maintaining overall model stability for real-world Customer Relationship Management (CRM) deployment.

Downloads

Download data is not yet available.

References

N. Ahmad, M. J. Awan, H. Nobanee, A. M. Zain, A. Naseem, and A. Mahmoud, “Customer Personality Analysis for Churn Prediction Using Hybrid Ensemble Models and Class Balancing Techniques,” IEEE Access, vol. 12, pp. 1865–1879, 2024, doi: 10.1109/ACCESS.2023.3334641.

J. Lin, “Application of machine learning in predicting consumer behavior and precision marketing,” PLoS One, vol. 20, no. 5 May, pp. 1–12, 2025, doi: 10.1371/journal.pone.0321854.

L. González-Flores, J. Rubiano-Moreno, and G. Sosa-Gómez, “The relevance of lead prioritization: a B2B lead scoring model based on machine learning,” Front. Artif. Intell., vol. 8, 2025, doi: 10.3389/frai.2025.1554325.

A. Yocupicio-Zazueta, A. Brau-Avila, F. Cirett-Galán, and M. Valenzuela-Galván, “Design and Deployment of ML in CRM to Identify Leads,” Appl. Artif. Intell., vol. 38, no. 1, 2024, doi: 10.1080/08839514.2024.2376978.

M. Mujahid et al., “Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering,” J. Big Data, vol. 11, no. 1, Dec. 2024, doi: 10.1186/s40537-024-00943-4.

M. Altalhan, A. Algarni, and M. Turki-Hadj Alouane, “Imbalanced Data Problem in Machine Learning: A Review,” IEEE Access, vol. 13, pp. 13686–13699, 2025, doi: 10.1109/ACCESS.2025.3531662.

A. Manzoor, M. Atif Qureshi, E. Kidney, and L. Longo, “A Review on Machine Learning Methods for Customer Churn Prediction and Recommendations for Business Practitioners,” IEEE Access, vol. 12, pp. 70434–70463, 2024, doi: 10.1109/ACCESS.2024.3402092.

E. F. Agyemang et al., “Addressing Class Imbalance Problem in Health Data Classification: Practical Application From an Oversampling Viewpoint,” Appl. Comput. Intell. Soft Comput., vol. 2025, no. 1, 2025, doi: 10.1155/acis/1013769.

Z. Zheng, “Financial Risk Early Warning Model Combining SMOTE and Random Forest for Internet Finance Companies,” J. Cases Inf. Technol., vol. 26, no. 1, 2024, doi: 10.4018/JCIT.356504.

Husain et al., “SMOTE vs. SMOTEENN: A Study on the Performance of Resampling Algorithms for Addressing Class Imbalance in Regression Models,” Algorithms, vol. 18, no. 1, Jan. 2025, doi: 10.3390/a18010037.

I. Aruleba and Y. Sun, “Effective Credit Risk Prediction Using Ensemble Classifiers With Model Explanation,” IEEE Access, vol. 12, pp. 115015–115025, 2024, doi: 10.1109/ACCESS.2024.3445308.

B. Amirshahi and S. Lahmiri, “Bankruptcy prediction using optimal ensemble models under balanced and imbalanced data,” Expert Syst., vol. 41, no. 8, Aug. 2024, doi: 10.1111/exsy.13599.

S. Gholampour, “Impact of Nature of Medical Data on Machine and Deep Learning for Imbalanced Datasets: Clinical Validity of SMOTE Is Questionable,” Mach. Learn. Knowl. Extr., vol. 6, no. 2, pp. 827–841, Jun. 2024, doi: 10.3390/make6020039.

N. S. Thomas and S. Kaliraj, “An Improved and Optimized Random Forest Based Approach to Predict the Software Faults,” SN Comput. Sci., vol. 5, no. 5, Jun. 2024, doi: 10.1007/s42979-024-02764-x.

J. Lyu, J. Yang, Z. Su, and Z. Zhu, “LD-SMOTE: A Novel Local Density Estimation-Based Oversampling Method for Imbalanced Datasets,” Symmetry (Basel)., vol. 17, no. 2, Feb. 2025, doi: 10.3390/sym17020160.

Prediksi Lead Scoring untuk Optimasi Penjualan Menggunakan Random Forest dan Teknik SMOTE

Authors

DOI:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License