OPTIMISASI ALGORITMA K-MEANS DENGAN METODE REDUKSI DIMENSI UNTUK PENGELOMPOKAN BIG DATA DALAM ARSITEKTUR CLOUD COMPUTING
DOI:
https://doi.org/10.37859/seis.v5i1.7616
Abstract
In the era of big data, data clustering becomes a major challenge due to the complexity and huge volume of data. The K-means algorithm is one of the clustering techniques that is often used due to its simplicity. However, K-means faces difficulties in handling high-dimensional and large-volume data. This study proposes an optimization of the K-means algorithm using the Principal Component Analysis (PCA) dimensionality reduction method to improve the efficiency and accuracy of big data clustering in cloud computing architecture. The KDD Cup 1999 dataset is used to test this method. The dataset undergoes pre-processing and dimensionality reduction using PCA, then K-means clustering is applied. The clustering results are evaluated using the Silhouette Score and Davies-Bouldin Index. The implementation is carried out in the Google Colab environment to utilize cloud computing resources. The results show that dimensionality reduction using PCA significantly reduces computational complexity and improves clustering quality. This method is effective in clustering big data, making it an efficient solution for data clustering in cloud computing architecture.
Downloads
References
Aceto, G., Persico, V., & Pescapé, A. (2020). Industry 4.0 and Health: Internet of Things, Big Data, and Cloud Computing for Healthcare 4.0. Journal of Industrial Information Integration, 18. https://doi.org/10.1016/j.jii.2020.100129
Alahmadi, A., Hussain, M., & Aboalsamh, H. (2022). LDA-CNN: Linear Discriminant Analysis Convolution Neural Network for Periocular Recognition in the Wild. Mathematics, 10(23). https://doi.org/10.3390/math10234604
Anowar, F., Sadaoui, S., & Selim, B. (2021). Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE). Computer Science Review, 40, 100378. https://doi.org/10.1016/j.cosrev.2021.100378
Askari, S. (2021). Fuzzy C-Means clustering algorithm for data with unequal cluster sizes and contaminated with noise and outliers: Review and development. Expert Systems with Applications, 165, 113856. https://doi.org/10.1016/j.eswa.2020.113856
Campbell, J. C., Hindle, A., & Stroulia, E. (2023). Latent Dirichlet Allocation: Extracting Topics from Software Engineering Data. The Art and Science of Analyzing Software Data, 3, 139–159. https://doi.org/10.1016/B978-0-12-411519-4.00006-9
Chen, C., Wang, L., Yang, G., Sun, W., & Song, Y. (2023). Mapping of Ecological Environment Based on Google Earth Engine Cloud Computing Platform and Landsat Long-Term Data: A Case Study of the Zhoushan Archipelago. Remote Sensing, 15(16). https://doi.org/10.3390/rs15164072
Fabiyi, S. D., Murray, P., Zabalza, J., & Ren, J. (2021). Folded LDA: Extending the Linear Discriminant Analysis Algorithm for Feature Extraction and Data Reduction in Hyperspectral Remote Sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 12312–12331. https://doi.org/10.1109/JSTARS.2021.3129818
Gabrielli, G., Magri, C., Medioli, A., & Marchini, P. L. (2024). The power of big data affordances to reshape anti-fraud strategies. Technological Forecasting and Social Change, 205(June), 123507. https://doi.org/10.1016/j.techfore.2024.123507
Jin, L., Zhai, X., Wang, K., Zhang, K., Wu, D., Nazir, A., Jiang, J., & Liao, W. H. (2024). Big data, machine learning, and digital twin assisted additive manufacturing: A review. Materials and Design, 244(March), 113086. https://doi.org/10.1016/j.matdes.2024.113086
Koehler, A., Scroferneker, M. L., de Souza, N. M. P., de Moraes, P. C., Pereira, B. A. S., de Souza Cavalcante, R., Mendes, R. P., & Corbellini, V. A. (2024). Rapid Classification of Serum from Patients with Paracoccidioidomycosis Using Infrared Spectroscopy, Univariate Statistics, and Linear Discriminant Analysis (LDA). Journal of Fungi, 10(2), 1–13. https://doi.org/10.3390/jof10020147
Li, H., Zhang, L., Huang, B., & Zhou, X. (2020). Cost-sensitive dual-bidirectional linear discriminant analysis. Information Sciences, 510, 283–303. https://doi.org/10.1016/j.ins.2019.09.032
Liu, J., Lee, J., & Zhou, R. (2023). Review of big-data and AI application in typhoon-related disaster risk early warning in Typhoon Committee region. Tropical Cyclone Research and Review, 12(4), 341–353. https://doi.org/10.1016/j.tcrr.2023.12.004
Ma, J., & Yuan, Y. (2019). Dimension reduction of image deep feature using PCA. Journal of Visual Communication and Image Representation, 63(July). https://doi.org/10.1016/j.jvcir.2019.102578
Minh, P. S., Dang, H. S., & Ha, N. C. (2023). Optimization of 3D Cooling Channels in Plastic Injection Molds by Taguchi-Integrated Principal Component Analysis (PCA). Polymers, 15(5). https://doi.org/10.3390/polym15051080
Neela, S. A., Neyyala, Y., Pendem, V. N., Peryala, K., & Kumar, V. V. (2021). Cloud Computing Based Learning Web Application through Amazon Web Services. 2021 7th International Conference on Advanced Computing and Communication Systems, ICACCS 2021, 472–475. https://doi.org/10.1109/ICACCS51430.2021.9441974
Pezzotti, N., Höllt, T., Lelieveldt, B., Eisemann, E., & Vilanova, A. (2019). Hierarchical Stochastic Neighbor Embedding. Computer Graphics Forum, 35(3), 21–30. https://doi.org/10.1111/cgf.12878
Poppe, O., Arora, P., Sharma, S., Chen, J., Pandit, S., Sawhney, R., Jhalani, V., Lang, W., Guo, Q., Inumella, A., Sridhar, S. D., Gala, D., Rathi, N., Oslake, M., Chirica, A., Iyer, S., Goel, P., & Kalhan, A. (2024). Proactive Resume and Pause of Resources for Microsoft Azure SQL Database Serverless. Proceedings of the ACM SIGMOD International Conference on Management of Data, 227–240. https://doi.org/10.1145/3626246.3653371
Ran, X., Zhou, X., Lei, M., Tepsan, W., & Deng, W. (2021). A novel K-means clustering algorithm with a noise algorithm for capturing urban hotspots. Applied Sciences (Switzerland), 11(23). https://doi.org/10.3390/app112311202
Sarmina, B. G., Sun, G. H., & Dong, S. H. (2023). Principal Component Analysis and t-Distributed Stochastic Neighbor Embedding Analysis in the Study of Quantum Approximate Optimization Algorithm Entangled and Non-Entangled Mixing Operators. Entropy, 25(11). https://doi.org/10.3390/e25111499
Stahl, F., Gabrys, B., Gaber, M. M., & Berendsen, M. (2019). An overview of interactive visual data mining techniques for knowledge discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3(4), 239–256. https://doi.org/10.1002/widm.1093
Tripathi, M., & Singal, S. K. (2019). Use of Principal Component Analysis for parameter selection for development of a novel Water Quality Index: A case study of river Ganga India. Ecological Indicators, 96(May 2018), 430–436. https://doi.org/10.1016/j.ecolind.2018.09.025
van der Vlist, F. N. (2022). Accounting for the social: Investigating commensuration and Big Data practices at Facebook. Big Data and Society, 3(1), 1–16. https://doi.org/10.1177/2053951716631365
Zaharia, V., Ignat, A., Palibroda, N., Ngameni, B., Kuete, V., Fokunang, C. N., Moungang, M. L., & Ngadjui, B. T. (2019). Synthesis of some p-toluenesulfonyl-hydrazinothiazoles and hydrazino-bis-thiazoles and their anticancer activity. European Journal of Medicinal Chemistry, 45(11), 5080–5085. https://doi.org/10.1016/j.ejmech.2010.08.017
Downloads
Published
How to Cite
Issue
Section
License
Copyright Notice
An author who publishes in the Journal of Software Engineering and Information System (SEIS) agrees to the following terms:
- Author retains the copyright and grants the journal the right of first publication of the work simultaneously licensed under the Creative Commons Attribution-ShareAlike 4.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal
- Author is able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book) with the acknowledgement of its initial publication in this journal.
- Author is permitted and encouraged to post his/her work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of the published work (See The Effect of Open Access).
Read more about the Creative Commons Attribution-ShareAlike 4.0 Licence here: https://creativecommons.org/licenses/by-sa/4.0/.






