Comparison of selected similarity measures for hierarchical clustering of categorical data
Thesis title: | Comparison of selected similarity measures for hierarchical clustering of categorical data |
---|---|
Author: | Abdullayev, Shahsuvar |
Thesis type: | Diploma thesis |
Supervisor: | Šulc, Zdeněk |
Opponents: | Čabla, Adam |
Thesis language: | English |
Abstract: | The aim of this thesis is to examine and compare the selected similarity measures for the hierarchical clustering of categorical data with variables that have more than two categories. Many the of categorical data clustering methods have not been researched well, because many of them are still in the phase of development. The analytical part of the thesis deals with the comparison of methods evaluated on generated data sets that contain variables with a different number of categories. In this study, single, complete, and average linkage methods, which are hierarchical clustering algorithms that can work on categorical data, are compared. In this comparison, the clusters formed were evaluated with the internal validity measures according to the selected similarity measures. Moreover, generated datasets were evaluated based on the discretization method, difficulty structure, and the number of categories in variables. The results show the similarity measures that have produced the best quality clusters based on the internal criteria coefficients. Based on the data analysis results, it cannot be said that the new approaches to the cluster analysis of categorical data achieve consistently better results in all cases. When calculating, it is, therefore, advisable to try several methods and decide on the basis of evaluation criteria, which method is the most suitable for a given data set. |
Keywords: | hierarchical cluster analysis; similarity measures; categorical data; internal evaluation criteria; nomclust package; data generator |
Thesis title: | Comparison of selected similarity measures for hierarchical clustering of categorical data |
---|---|
Author: | Abdullayev, Shahsuvar |
Thesis type: | Diplomová práce |
Supervisor: | Šulc, Zdeněk |
Opponents: | Čabla, Adam |
Thesis language: | English |
Abstract: | The aim of this thesis is to examine and compare the selected similarity measures for the hierarchical clustering of categorical data with variables that have more than two categories. Many the of categorical data clustering methods have not been researched well, because many of them are still in the phase of development. The analytical part of the thesis deals with the comparison of methods evaluated on generated data sets that contain variables with a different number of categories. In this study, single, complete, and average linkage methods, which are hierarchical clustering algorithms that can work on categorical data, are compared. In this comparison, the clusters formed were evaluated with the internal validity measures according to the selected similarity measures. Moreover, generated datasets were evaluated based on the discretization method, difficulty structure, and the number of categories in variables. The results show the similarity measures that have produced the best quality clusters based on the internal criteria coefficients. Based on the data analysis results, it cannot be said that the new approaches to the cluster analysis of categorical data achieve consistently better results in all cases. When calculating, it is, therefore, advisable to try several methods and decide on the basis of evaluation criteria, which method is the most suitable for a given data set. |
Keywords: | hierarchical cluster analysis; similarity measures; categorical data; internal evaluation criteria; nomclust package; data generator |
Information about study
Study programme: | Economic Data Analysis/Data Analysis and Modeling |
---|---|
Type of study programme: | Magisterský studijní program |
Assigned degree: | Ing. |
Institutions assigning academic degree: | Vysoká škola ekonomická v Praze |
Faculty: | Faculty of Informatics and Statistics |
Department: | Department of Statistics and Probability |
Information on submission and defense
Date of assignment: | 2. 11. 2021 |
---|---|
Date of submission: | 5. 12. 2022 |
Date of defense: | 30. 1. 2023 |
Identifier in the InSIS system: | https://insis.vse.cz/zp/78614/podrobnosti |