OPTIMISASI ALGORITMA K-MEANS DENGAN METODE REDUKSI DIMENSI UNTUK PENGELOMPOKAN BIG DATA DALAM ARSITEKTUR CLOUD COMPUTING

Authors

  • Bayu Anugerah Putra Fakultas Ilmu Komputer, Universitas Muhammadiyah Riau
  • Harun Mukhtar Fakultas Ilmu Komputer, Universitas Muhammadiyah Riau
  • Elsi Titasari Br Bangun Fakultas Ilmu Komputer, Universitas Muhammadiyah Riau
  • Alris Gusnanda Fakultas Ilmu Komputer, Universitas Muhammadiyah Riau
  • Adila Maisyarah Fakultas Ilmu Komputer, Universitas Muhammadiyah Riau
  • Muhammad Irgi Kurniawan Fakultas Ilmu Komputer, Universitas Muhammadiyah Riau
  • Raditya Pradipa Fakultas Ilmu Komputer, Universitas Muhammadiyah Riau
  • Zurrahman Muhammad Ali Fakultas Ilmu Komputer, Universitas Muhammadiyah Riau

DOI:

https://doi.org/10.37859/seis.v5i1.7616
Keywords: K-Means Optimization, Dimensionality Reduction, Principal Component Analysis (PCA), Big Data Clustering, Cloud Computing Architecture, KDD Cup 1999, Clustering Evaluation

Abstract

In the era of big data, data clustering becomes a major challenge due to the complexity and huge volume of data. The K-means algorithm is one of the clustering techniques that is often used due to its simplicity. However, K-means faces difficulties in handling high-dimensional and large-volume data. This study proposes an optimization of the K-means algorithm using the Principal Component Analysis (PCA) dimensionality reduction method to improve the efficiency and accuracy of big data clustering in cloud computing architecture. The KDD Cup 1999 dataset is used to test this method. The dataset undergoes pre-processing and dimensionality reduction using PCA, then K-means clustering is applied. The clustering results are evaluated using the Silhouette Score and Davies-Bouldin Index. The implementation is carried out in the Google Colab environment to utilize cloud computing resources. The results show that dimensionality reduction using PCA significantly reduces computational complexity and improves clustering quality. This method is effective in clustering big data, making it an efficient solution for data clustering in cloud computing architecture.

Downloads

Download data is not yet available.

References

Aceto, G., Persico, V., & Pescapé, A. (2020). Industry 4.0 and Health: Internet of Things, Big Data, and Cloud Computing for Healthcare 4.0. Journal of Industrial Information Integration, 18. https://doi.org/10.1016/j.jii.2020.100129

Alahmadi, A., Hussain, M., & Aboalsamh, H. (2022). LDA-CNN: Linear Discriminant Analysis Convolution Neural Network for Periocular Recognition in the Wild. Mathematics, 10(23). https://doi.org/10.3390/math10234604

Anowar, F., Sadaoui, S., & Selim, B. (2021). Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE). Computer Science Review, 40, 100378. https://doi.org/10.1016/j.cosrev.2021.100378

Askari, S. (2021). Fuzzy C-Means clustering algorithm for data with unequal cluster sizes and contaminated with noise and outliers: Review and development. Expert Systems with Applications, 165, 113856. https://doi.org/10.1016/j.eswa.2020.113856

Campbell, J. C., Hindle, A., & Stroulia, E. (2023). Latent Dirichlet Allocation: Extracting Topics from Software Engineering Data. The Art and Science of Analyzing Software Data, 3, 139–159. https://doi.org/10.1016/B978-0-12-411519-4.00006-9

Chen, C., Wang, L., Yang, G., Sun, W., & Song, Y. (2023). Mapping of Ecological Environment Based on Google Earth Engine Cloud Computing Platform and Landsat Long-Term Data: A Case Study of the Zhoushan Archipelago. Remote Sensing, 15(16). https://doi.org/10.3390/rs15164072

Fabiyi, S. D., Murray, P., Zabalza, J., & Ren, J. (2021). Folded LDA: Extending the Linear Discriminant Analysis Algorithm for Feature Extraction and Data Reduction in Hyperspectral Remote Sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 12312–12331. https://doi.org/10.1109/JSTARS.2021.3129818

Gabrielli, G., Magri, C., Medioli, A., & Marchini, P. L. (2024). The power of big data affordances to reshape anti-fraud strategies. Technological Forecasting and Social Change, 205(June), 123507. https://doi.org/10.1016/j.techfore.2024.123507

Jin, L., Zhai, X., Wang, K., Zhang, K., Wu, D., Nazir, A., Jiang, J., & Liao, W. H. (2024). Big data, machine learning, and digital twin assisted additive manufacturing: A review. Materials and Design, 244(March), 113086. https://doi.org/10.1016/j.matdes.2024.113086

Koehler, A., Scroferneker, M. L., de Souza, N. M. P., de Moraes, P. C., Pereira, B. A. S., de Souza Cavalcante, R., Mendes, R. P., & Corbellini, V. A. (2024). Rapid Classification of Serum from Patients with Paracoccidioidomycosis Using Infrared Spectroscopy, Univariate Statistics, and Linear Discriminant Analysis (LDA). Journal of Fungi, 10(2), 1–13. https://doi.org/10.3390/jof10020147

Li, H., Zhang, L., Huang, B., & Zhou, X. (2020). Cost-sensitive dual-bidirectional linear discriminant analysis. Information Sciences, 510, 283–303. https://doi.org/10.1016/j.ins.2019.09.032

Liu, J., Lee, J., & Zhou, R. (2023). Review of big-data and AI application in typhoon-related disaster risk early warning in Typhoon Committee region. Tropical Cyclone Research and Review, 12(4), 341–353. https://doi.org/10.1016/j.tcrr.2023.12.004

Ma, J., & Yuan, Y. (2019). Dimension reduction of image deep feature using PCA. Journal of Visual Communication and Image Representation, 63(July). https://doi.org/10.1016/j.jvcir.2019.102578

Minh, P. S., Dang, H. S., & Ha, N. C. (2023). Optimization of 3D Cooling Channels in Plastic Injection Molds by Taguchi-Integrated Principal Component Analysis (PCA). Polymers, 15(5). https://doi.org/10.3390/polym15051080

Neela, S. A., Neyyala, Y., Pendem, V. N., Peryala, K., & Kumar, V. V. (2021). Cloud Computing Based Learning Web Application through Amazon Web Services. 2021 7th International Conference on Advanced Computing and Communication Systems, ICACCS 2021, 472–475. https://doi.org/10.1109/ICACCS51430.2021.9441974

Pezzotti, N., Höllt, T., Lelieveldt, B., Eisemann, E., & Vilanova, A. (2019). Hierarchical Stochastic Neighbor Embedding. Computer Graphics Forum, 35(3), 21–30. https://doi.org/10.1111/cgf.12878

Poppe, O., Arora, P., Sharma, S., Chen, J., Pandit, S., Sawhney, R., Jhalani, V., Lang, W., Guo, Q., Inumella, A., Sridhar, S. D., Gala, D., Rathi, N., Oslake, M., Chirica, A., Iyer, S., Goel, P., & Kalhan, A. (2024). Proactive Resume and Pause of Resources for Microsoft Azure SQL Database Serverless. Proceedings of the ACM SIGMOD International Conference on Management of Data, 227–240. https://doi.org/10.1145/3626246.3653371

Ran, X., Zhou, X., Lei, M., Tepsan, W., & Deng, W. (2021). A novel K-means clustering algorithm with a noise algorithm for capturing urban hotspots. Applied Sciences (Switzerland), 11(23). https://doi.org/10.3390/app112311202

Sarmina, B. G., Sun, G. H., & Dong, S. H. (2023). Principal Component Analysis and t-Distributed Stochastic Neighbor Embedding Analysis in the Study of Quantum Approximate Optimization Algorithm Entangled and Non-Entangled Mixing Operators. Entropy, 25(11). https://doi.org/10.3390/e25111499

Stahl, F., Gabrys, B., Gaber, M. M., & Berendsen, M. (2019). An overview of interactive visual data mining techniques for knowledge discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3(4), 239–256. https://doi.org/10.1002/widm.1093

Tripathi, M., & Singal, S. K. (2019). Use of Principal Component Analysis for parameter selection for development of a novel Water Quality Index: A case study of river Ganga India. Ecological Indicators, 96(May 2018), 430–436. https://doi.org/10.1016/j.ecolind.2018.09.025

van der Vlist, F. N. (2022). Accounting for the social: Investigating commensuration and Big Data practices at Facebook. Big Data and Society, 3(1), 1–16. https://doi.org/10.1177/2053951716631365

Zaharia, V., Ignat, A., Palibroda, N., Ngameni, B., Kuete, V., Fokunang, C. N., Moungang, M. L., & Ngadjui, B. T. (2019). Synthesis of some p-toluenesulfonyl-hydrazinothiazoles and hydrazino-bis-thiazoles and their anticancer activity. European Journal of Medicinal Chemistry, 45(11), 5080–5085. https://doi.org/10.1016/j.ejmech.2010.08.017

Downloads

Published

2025-01-24

How to Cite

Putra, B. A. ., Mukhtar, H., Br Bangun, E. T. ., Gusnanda, A., Maisyarah, A., Kurniawan, M. I., Pradipa, R., & Ali, Z. M. (2025). OPTIMISASI ALGORITMA K-MEANS DENGAN METODE REDUKSI DIMENSI UNTUK PENGELOMPOKAN BIG DATA DALAM ARSITEKTUR CLOUD COMPUTING. Journal of Software Engineering and Information System (SEIS), 5(1), 1–8. https://doi.org/10.37859/seis.v5i1.7616

Issue

Section

Articles