Comparison of selected similarity measures for hierarchical clustering of categorical data

Název práce: Comparison of selected similarity measures for hierarchical clustering of categorical data
Autor(ka) práce: Abdullayev, Shahsuvar
Typ práce: Diploma thesis
Vedoucí práce: Šulc, Zdeněk
Oponenti práce: Čabla, Adam
Jazyk práce: English
Abstrakt:
The aim of this thesis is to examine and compare the selected similarity measures for the hierarchical clustering of categorical data with variables that have more than two categories. Many the of categorical data clustering methods have not been researched well, because many of them are still in the phase of development. The analytical part of the thesis deals with the comparison of methods evaluated on generated data sets that contain variables with a different number of categories. In this study, single, complete, and average linkage methods, which are hierarchical clustering algorithms that can work on categorical data, are compared. In this comparison, the clusters formed were evaluated with the internal validity measures according to the selected similarity measures. Moreover, generated datasets were evaluated based on the discretization method, difficulty structure, and the number of categories in variables. The results show the similarity measures that have produced the best quality clusters based on the internal criteria coefficients. Based on the data analysis results, it cannot be said that the new approaches to the cluster analysis of categorical data achieve consistently better results in all cases. When calculating, it is, therefore, advisable to try several methods and decide on the basis of evaluation criteria, which method is the most suitable for a given data set.
Klíčová slova: hierarchical cluster analysis; similarity measures; categorical data; internal evaluation criteria; nomclust package; data generator
Název práce: Comparison of selected similarity measures for hierarchical clustering of categorical data
Autor(ka) práce: Abdullayev, Shahsuvar
Typ práce: Diplomová práce
Vedoucí práce: Šulc, Zdeněk
Oponenti práce: Čabla, Adam
Jazyk práce: English
Abstrakt:
The aim of this thesis is to examine and compare the selected similarity measures for the hierarchical clustering of categorical data with variables that have more than two categories. Many the of categorical data clustering methods have not been researched well, because many of them are still in the phase of development. The analytical part of the thesis deals with the comparison of methods evaluated on generated data sets that contain variables with a different number of categories. In this study, single, complete, and average linkage methods, which are hierarchical clustering algorithms that can work on categorical data, are compared. In this comparison, the clusters formed were evaluated with the internal validity measures according to the selected similarity measures. Moreover, generated datasets were evaluated based on the discretization method, difficulty structure, and the number of categories in variables. The results show the similarity measures that have produced the best quality clusters based on the internal criteria coefficients. Based on the data analysis results, it cannot be said that the new approaches to the cluster analysis of categorical data achieve consistently better results in all cases. When calculating, it is, therefore, advisable to try several methods and decide on the basis of evaluation criteria, which method is the most suitable for a given data set.
Klíčová slova: hierarchical cluster analysis; similarity measures; categorical data; internal evaluation criteria; nomclust package; data generator

Informace o studiu

Studijní program / obor: Economic Data Analysis/Data Analysis and Modeling
Typ studijního programu: Magisterský studijní program
Přidělovaná hodnost: Ing.
Instituce přidělující hodnost: Vysoká škola ekonomická v Praze
Fakulta: Fakulta informatiky a statistiky
Katedra: Katedra statistiky a pravděpodobnosti

Informace o odevzdání a obhajobě

Datum zadání práce: 2. 11. 2021
Datum podání práce: 5. 12. 2022
Datum obhajoby: 30. 1. 2023
Identifikátor v systému InSIS: https://insis.vse.cz/zp/78614/podrobnosti

Soubory ke stažení

    Poslední aktualizace: