Comparison of selected similarity measures for hierarchical clustering of categorical data

Thesis title: Comparison of selected similarity measures for hierarchical clustering of categorical data
Author: Abdullayev, Shahsuvar
Thesis type: Diploma thesis
Supervisor: Šulc, Zdeněk
Opponents: Čabla, Adam
Thesis language: English
Abstract:
The aim of this thesis is to examine and compare the selected similarity measures for the hierarchical clustering of categorical data with variables that have more than two categories. Many the of categorical data clustering methods have not been researched well, because many of them are still in the phase of development. The analytical part of the thesis deals with the comparison of methods evaluated on generated data sets that contain variables with a different number of categories. In this study, single, complete, and average linkage methods, which are hierarchical clustering algorithms that can work on categorical data, are compared. In this comparison, the clusters formed were evaluated with the internal validity measures according to the selected similarity measures. Moreover, generated datasets were evaluated based on the discretization method, difficulty structure, and the number of categories in variables. The results show the similarity measures that have produced the best quality clusters based on the internal criteria coefficients. Based on the data analysis results, it cannot be said that the new approaches to the cluster analysis of categorical data achieve consistently better results in all cases. When calculating, it is, therefore, advisable to try several methods and decide on the basis of evaluation criteria, which method is the most suitable for a given data set.
Keywords: hierarchical cluster analysis; similarity measures; categorical data; internal evaluation criteria; nomclust package; data generator
Thesis title: Comparison of selected similarity measures for hierarchical clustering of categorical data
Author: Abdullayev, Shahsuvar
Thesis type: Diplomová práce
Supervisor: Šulc, Zdeněk
Opponents: Čabla, Adam
Thesis language: English
Abstract:
The aim of this thesis is to examine and compare the selected similarity measures for the hierarchical clustering of categorical data with variables that have more than two categories. Many the of categorical data clustering methods have not been researched well, because many of them are still in the phase of development. The analytical part of the thesis deals with the comparison of methods evaluated on generated data sets that contain variables with a different number of categories. In this study, single, complete, and average linkage methods, which are hierarchical clustering algorithms that can work on categorical data, are compared. In this comparison, the clusters formed were evaluated with the internal validity measures according to the selected similarity measures. Moreover, generated datasets were evaluated based on the discretization method, difficulty structure, and the number of categories in variables. The results show the similarity measures that have produced the best quality clusters based on the internal criteria coefficients. Based on the data analysis results, it cannot be said that the new approaches to the cluster analysis of categorical data achieve consistently better results in all cases. When calculating, it is, therefore, advisable to try several methods and decide on the basis of evaluation criteria, which method is the most suitable for a given data set.
Keywords: hierarchical cluster analysis; similarity measures; categorical data; internal evaluation criteria; nomclust package; data generator

Information about study

Study programme: Economic Data Analysis/Data Analysis and Modeling
Type of study programme: Magisterský studijní program
Assigned degree: Ing.
Institutions assigning academic degree: Vysoká škola ekonomická v Praze
Faculty: Faculty of Informatics and Statistics
Department: Department of Statistics and Probability

Information on submission and defense

Date of assignment: 2. 11. 2021
Date of submission: 5. 12. 2022
Date of defense: 30. 1. 2023
Identifier in the InSIS system: https://insis.vse.cz/zp/78614/podrobnosti

Files for download

    Last update: