Application of K-Means and Decision Tree for Disease Prediction Using Data Mining Approach
Keywords:
K-Means Clustering, Data Mining, decision treeAbstract
This study aims to analyze the distribution patterns of patient diseases using a data mining approach at UPTD Puskesmas Pakkat. The dataset consists of secondary data from 4,633 patients collected between January 2022 and December 2023, obtained from digital medical records, with variables including age, gender, and 22 disease diagnosis categories. The K-Means Clustering method was employed to identify disease grouping patterns based on patient characteristics. The optimal number of clusters was determined using the Silhouette Score, with the best value of 0.5556 at K=6. Cluster quality was further evaluated using the Davies-Bouldin Index (DBI) with a value of 0.6722, indicating good cluster separation. To support the classification process, the Decision Tree algorithm was applied to predict cluster membership for new patient data. Model evaluation was conducted using a train-test split scheme and k-fold cross-validation to enhance reliability and minimize the risk of overfitting. The results indicate distinct disease patterns across age groups, where infectious diseases such as acute respiratory infections (ARI) and diarrhea dominate in children, while non-communicable diseases such as hypertension and diabetes are more prevalent among adults and the elderly. This study contributes by integrating clustering and classification methods and provides data-driven epidemiological insights that can support decision-making in primary healthcare services.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Riah Ginting, Fernando H Sinaga, Rianto Sitanggang, Ivan Elisabeth Purba, Aprima A Matondang

This work is licensed under a Creative Commons Attribution 4.0 International License.











