Pengaruh Agregasi Data pada Klasifikasi Sentimen untuk Dataset Terbatas Menggunakan SGD Classifier
Abstract
Social media, especially Twitter or X, is a rich source of data for sentiment analysis. However, dataset limitation is a major challenge in utilizing machine learning, especially to produce fast and accurate sentiment analysis. This research applies data aggregation techniques to expand the training dataset and tests various preprocessing steps, such as cleaning, case folding, normalization, stemming, and lexicon-based methods. The classification method used is Stochastic Gradient Descent Classifier with text representation using Fast Text language model to generate word embedding. Lexicon-based preprocessing, particularly for emoji and emoticon handling, shows significant impact when data is added, as it is able to capture additional emotion and context that is often overlooked in conventional text analysis. Experimental results show that data addition and preprocessing optimization improved F1 Score from a baseline of 40% to 52.13%, surpassing the organizer which reached 51.28%. These findings emphasize the importance of data aggregation, preprocessing optimization, and parameter tuning using grid search in improving model performance on text sentiment classification with limited datasets.
Downloads
References
[2] I. H. Hasibuan, E. Budianita, S. Agustian, and P. Pizaini, “Klasifikasi Sentimen Komentar Youtube Tentang Pembatalan Indonesia Sebagai Tuan Rumah Piala Dunia U-20 Menggunakan Algoritma Naïve Bayes Classifer,” Jurnal Sistem Komputer dan Informatika (JSON), vol. 5, no. 2, p. 249, Dec. 2023, doi: 10.30865/json.v5i2.7096.
[3] S. Agustian et al., “New Directions in Text Classification Research: Maximizing The Performance of Sentiment Classification from Limited Data Arah Baru Penelitian Klasifikasi Teks: Memaksimalkan Kinerja Klasifikasi Sentimen dari Data Terbatas,” 2024. [Online]. Available: https://github.com/s4gustian/Small_DataSet_Sentiment_Classification
[4] N. Narayan, M. Biswal, P. Goyal, and A. Panigrahi, “Hate Speech and Offensive Content Detection in Indo-Aryan Languages: A Battle of LSTM and Transformers,” Dec. 2023.
[5] Y. El Saputra, S. Agustian, and S. Ramadhani, “KLIK: Kajian Ilmiah Informatika dan Komputer Klasifikasi Sentimen SVM Dengan Dataset yang Kecil Pada Kasus Kaesang Sebagai Ketua Umum PSI,” Media Online), vol. 4, no. 6, pp. 2902–2908, 2024, doi: 10.30865/klik.v4i6.1944.
[6] H.-T. Duong and T.-A. Nguyen-Thi, “A review: preprocessing techniques and data augmentation for sentiment analysis,” Comput Soc Netw, vol. 8, no. 1, p. 1, Dec. 2021, doi: 10.1186/s40649-020-00080-x.
[7] K. Kenyon-Dean et al., “Sentiment Analysis: It’s Complicated!,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Stroudsburg, PA, USA: Association for Computational Linguistics, 2018, pp. 1886–1895. doi: 10.18653/v1/N18-1171.
[8] M. Ihsan, Benny Sukma Negara, and Surya Agustian, “LSTM (Long Short Term Memory) for Sentiment COVID-19 Vaccine Classification on Twitter,” Digital Zone: Jurnal Teknologi Informasi dan Komunikasi, vol. 13, no. 1, pp. 79–89, May 2022, doi: 10.31849/digitalzone.v13i1.9950.
[9] A. Permana, S. S.-J. C. (Computer Science, and undefined 2023, “Perbandingan algoritma k-nearst neighbor dan naïve bayes pada aplikasi shopee,” ejurnal.umri.ac.idAO Permana, S SaepudinJurnal CoSciTech (Computer Science and Information Technology), 2023•ejurnal.umri.ac.id, Accessed: Nov. 19, 2024. [Online]. Available: https://ejurnal.umri.ac.id/index.php/coscitech/article/view/4474
[10] P. Yohana, S. Agustian, and S. Kurnia Gusti, “Klasifikasi Sentimen Masyarakat terhadap Kebijakan Vaksin Covid-19 pada Twitter dengan Imbalance Classes Menggunakan Naive Bayes,” SNTIKI : Seminar Nasional Teknologi Informasi Komunikasi dan Industri, vol. 26, pp. 69–80, Oct. 2022, [Online]. Available: https://lp2m.unmul.ac.id/webadmin/public/upload/files/9584b64517cfe308eb6b115847cbe8e7.pdf
[11] M. Fernández-Gavilanes, J. Juncal-Martínez, S. García-Méndez, E. Costa-Montenegro, and F. J. González-Castaño, “Creating emoji lexica from unsupervised sentiment analysis of their descriptions,” Expert Syst Appl, vol. 103, pp. 74–91, 2018, doi: https://doi.org/10.1016/j.eswa.2018.02.043.
[12] A. Surikov and E. Egorova, “Alternative method sentiment analysis using emojis and emoticons,” Procedia Comput Sci, vol. 178, pp. 182–193, 2020, doi: https://doi.org/10.1016/j.procs.2020.11.020.
[13] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of Tricks for Efficient Text Classification,” Jul. 2016.
[14] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Trans Assoc Comput Linguist, vol. 5, pp. 135–146, Dec. 2017, doi: 10.1162/tacl_a_00051.
[15] S. F. Sabbeh and H. A. Fasihuddin, “A Comparative Analysis of Word Embedding and Deep Learning for Arabic Sentiment Classification,” Electronics (Basel), vol. 12, no. 6, p. 1425, Mar. 2023, doi: 10.3390/electronics12061425.
[16] F. Pedregosa FABIANPEDREGOSA et al., “Scikit-learn: Machine Learning in Python Gaël Varoquaux Bertrand Thirion Vincent Dubourg Alexandre Passos PEDREGOSA, VAROQUAUX, GRAMFORT ET AL. Matthieu Perrot,” 2011. [Online]. Available: http://scikit-learn.sourceforge.net.
[17] S. Ruder, “An overview of gradient descent optimization algorithms,” Sep. 2016.
[18] D. M. Belete and M. D. Huchaiah, “Grid search in hyperparameter optimization of machine learning models for prediction of HIV/AIDS test results,” International Journal of Computers and Applications, vol. 44, no. 9, pp. 875–886, Sep. 2022, doi: 10.1080/1206212X.2021.1974663.
[19] Y. N. Fuadah, M. A. Pramudito, and K. M. Lim, “An Optimal Approach for Heart Sound Classification Using Grid Search in Hyperparameter Optimization of Machine Learning,” Bioengineering, vol. 10, no. 1, p. 45, Dec. 2022, doi: 10.3390/bioengineering10010045.
[20] R. Firdaus, J. Satria, B. B.-J. C. (Computer, and undefined 2022, “Klasifikasi Jenis Kelamin Berdasarkan Gambar Mata Menggunakan Algoritma Convolutional Neural Network (CNN),” ejurnal.umri.ac.id, Accessed: Nov. 19, 2024. [Online]. Available: https://ejurnal.umri.ac.id/index.php/coscitech/article/view/4360
[21] S. Agustian and A. Nazir, “Klasifikasi Sentimen Terhadap Pengangkatan Kaesang Sebagai Ketua Umum Partai PSI Menggunakan Metode Support Vector Machine,” Technology and Science (BITS), vol. 6, no. 1, 2024, doi: 10.47065/bits.v6i1.5340.

