Thesis title: |
Effect of outliers on hierarchical cluster analysis of qualitative data |
Author: |
Çırak, Elif |
Thesis type: |
Diploma thesis |
Supervisor: |
Cibulková, Jana |
Opponents: |
Šulc, Zdeněk |
Thesis language: |
English |
Abstract: |
This thesis evaluates the effects of outliers on the performance of hierarchical clustering algorithms applied to qualitative data. Specifically, the study examines how the presence of outliers distorts clustering outcomes and whether the application of appropriate outlier detection techniques enhances cluster validity and structural stability. Hierarchical clustering, while widely used for exploratory analysis, is particularly sensitive to outliers, especially in the context of qualitative data where non-metric similarity measures and linkage methods are applied. This thesis investigates the influence of both individual and group outliers using synthetically generated categorical datasets and applies three detection techniques: Clustering-Based Outlier Detection, Local Outlier Factor, and Frequent Pattern Outlier Factor. Clustering performance is evaluated before and after outlier removal across various similarity (SMC, Lin, Eskin) and linkage combinations (single, complete, average). Internal (Silhouette Index, Dunn Index) and external (Adjusted Rand Index, Purity) validation metrics are employed to assess clustering quality. Results show that outliers negatively impact cluster quality, while targeted detection methods improve clustering robustness. The findings offer methodological guidance for analysts dealing with categorical data in domains such as market segmentation, social science, and health informatics. |
Keywords: |
Hierarchical Cluster Analysis; Outliers in Clustering; Qualitative Data; Sensitivity to Outliers |
Thesis title: |
Effect of outliers on hierarchical cluster analysis of qualitative data |
Author: |
Çırak, Elif |
Thesis type: |
Diplomová práce |
Supervisor: |
Cibulková, Jana |
Opponents: |
Šulc, Zdeněk |
Thesis language: |
English |
Abstract: |
This thesis evaluates the effects of outliers on the performance of hierarchical clustering algorithms applied to qualitative data. Specifically, the study examines how the presence of outliers distorts clustering outcomes and whether the application of appropriate outlier detection techniques enhances cluster validity and structural stability. Hierarchical clustering, while widely used for exploratory analysis, is particularly sensitive to outliers, especially in the context of qualitative data where non-metric similarity measures and linkage methods are applied. This thesis investigates the influence of both individual and group outliers using synthetically generated categorical datasets and applies three detection techniques: Clustering-Based Outlier Detection, Local Outlier Factor, and Frequent Pattern Outlier Factor. Clustering performance is evaluated before and after outlier removal across various similarity (SMC, Lin, Eskin) and linkage combinations (single, complete, average). Internal (Silhouette Index, Dunn Index) and external (Adjusted Rand Index, Purity) validation metrics are employed to assess clustering quality. Results show that outliers negatively impact cluster quality, while targeted detection methods improve clustering robustness. The findings offer methodological guidance for analysts dealing with categorical data in domains such as market segmentation, social science, and health informatics. |
Keywords: |
Qualitative Data; Hierarchical Cluster Analysis; Outliers in Clustering |
Information about study
Study programme: |
Economic Data Analysis/Data Analysis and Modeling |
Type of study programme: |
Magisterský studijní program |
Assigned degree: |
Ing. |
Institutions assigning academic degree: |
Vysoká škola ekonomická v Praze |
Faculty: |
Faculty of Informatics and Statistics |
Department: |
Department of Statistics and Probability |
Information on submission and defense
Date of assignment: |
31. 10. 2023 |
Date of submission: |
26. 6. 2025 |
Date of defense: |
2025 |
Files for download
The files will be available after the defense of the thesis.