Comparison of selected similarity measures for hierarchical clustering of categorical data
Název práce: | Comparison of selected similarity measures for hierarchical clustering of categorical data |
---|---|
Autor(ka) práce: | Abdullayev, Shahsuvar |
Typ práce: | Diploma thesis |
Vedoucí práce: | Šulc, Zdeněk |
Oponenti práce: | Čabla, Adam |
Jazyk práce: | English |
Abstrakt: | The aim of this thesis is to examine and compare the selected similarity measures for the hierarchical clustering of categorical data with variables that have more than two categories. Many the of categorical data clustering methods have not been researched well, because many of them are still in the phase of development. The analytical part of the thesis deals with the comparison of methods evaluated on generated data sets that contain variables with a different number of categories. In this study, single, complete, and average linkage methods, which are hierarchical clustering algorithms that can work on categorical data, are compared. In this comparison, the clusters formed were evaluated with the internal validity measures according to the selected similarity measures. Moreover, generated datasets were evaluated based on the discretization method, difficulty structure, and the number of categories in variables. The results show the similarity measures that have produced the best quality clusters based on the internal criteria coefficients. Based on the data analysis results, it cannot be said that the new approaches to the cluster analysis of categorical data achieve consistently better results in all cases. When calculating, it is, therefore, advisable to try several methods and decide on the basis of evaluation criteria, which method is the most suitable for a given data set. |
Klíčová slova: | hierarchical cluster analysis; similarity measures; categorical data; internal evaluation criteria; nomclust package; data generator |
Název práce: | Comparison of selected similarity measures for hierarchical clustering of categorical data |
---|---|
Autor(ka) práce: | Abdullayev, Shahsuvar |
Typ práce: | Diplomová práce |
Vedoucí práce: | Šulc, Zdeněk |
Oponenti práce: | Čabla, Adam |
Jazyk práce: | English |
Abstrakt: | The aim of this thesis is to examine and compare the selected similarity measures for the hierarchical clustering of categorical data with variables that have more than two categories. Many the of categorical data clustering methods have not been researched well, because many of them are still in the phase of development. The analytical part of the thesis deals with the comparison of methods evaluated on generated data sets that contain variables with a different number of categories. In this study, single, complete, and average linkage methods, which are hierarchical clustering algorithms that can work on categorical data, are compared. In this comparison, the clusters formed were evaluated with the internal validity measures according to the selected similarity measures. Moreover, generated datasets were evaluated based on the discretization method, difficulty structure, and the number of categories in variables. The results show the similarity measures that have produced the best quality clusters based on the internal criteria coefficients. Based on the data analysis results, it cannot be said that the new approaches to the cluster analysis of categorical data achieve consistently better results in all cases. When calculating, it is, therefore, advisable to try several methods and decide on the basis of evaluation criteria, which method is the most suitable for a given data set. |
Klíčová slova: | hierarchical cluster analysis; similarity measures; categorical data; internal evaluation criteria; nomclust package; data generator |
Informace o studiu
Studijní program / obor: | Economic Data Analysis/Data Analysis and Modeling |
---|---|
Typ studijního programu: | Magisterský studijní program |
Přidělovaná hodnost: | Ing. |
Instituce přidělující hodnost: | Vysoká škola ekonomická v Praze |
Fakulta: | Fakulta informatiky a statistiky |
Katedra: | Katedra statistiky a pravděpodobnosti |
Informace o odevzdání a obhajobě
Datum zadání práce: | 2. 11. 2021 |
---|---|
Datum podání práce: | 5. 12. 2022 |
Datum obhajoby: | 30. 1. 2023 |
Identifikátor v systému InSIS: | https://insis.vse.cz/zp/78614/podrobnosti |