Effect of outliers on hierarchical cluster analysis of qualitative data
Autor(ka) práce:
Çırak, Elif
Typ práce:
Diploma thesis
Vedoucí práce:
Cibulková, Jana
Oponenti práce:
Šulc, Zdeněk
Jazyk práce:
English
Abstrakt:
This thesis evaluates the effects of outliers on the performance of hierarchical clustering algorithms applied to qualitative data. Specifically, the study examines how the presence of outliers distorts clustering outcomes and whether the application of appropriate outlier detection techniques enhances cluster validity and structural stability. Hierarchical clustering, while widely used for exploratory analysis, is particularly sensitive to outliers, especially in the context of qualitative data where non-metric similarity measures and linkage methods are applied. This thesis investigates the influence of both individual and group outliers using synthetically generated categorical datasets and applies three detection techniques: Clustering-Based Outlier Detection, Local Outlier Factor, and Frequent Pattern Outlier Factor. Clustering performance is evaluated before and after outlier removal across various similarity (SMC, Lin, Eskin) and linkage combinations (single, complete, average). Internal (Silhouette Index, Dunn Index) and external (Adjusted Rand Index, Purity) validation metrics are employed to assess clustering quality. Results show that outliers negatively impact cluster quality, while targeted detection methods improve clustering robustness. The findings offer methodological guidance for analysts dealing with categorical data in domains such as market segmentation, social science, and health informatics.
Klíčová slova:
Hierarchical Cluster Analysis; Outliers in Clustering; Qualitative Data; Sensitivity to Outliers
Název práce:
Effect of outliers on hierarchical cluster analysis of qualitative data
Autor(ka) práce:
Çırak, Elif
Typ práce:
Diplomová práce
Vedoucí práce:
Cibulková, Jana
Oponenti práce:
Šulc, Zdeněk
Jazyk práce:
English
Abstrakt:
This thesis evaluates the effects of outliers on the performance of hierarchical clustering algorithms applied to qualitative data. Specifically, the study examines how the presence of outliers distorts clustering outcomes and whether the application of appropriate outlier detection techniques enhances cluster validity and structural stability. Hierarchical clustering, while widely used for exploratory analysis, is particularly sensitive to outliers, especially in the context of qualitative data where non-metric similarity measures and linkage methods are applied. This thesis investigates the influence of both individual and group outliers using synthetically generated categorical datasets and applies three detection techniques: Clustering-Based Outlier Detection, Local Outlier Factor, and Frequent Pattern Outlier Factor. Clustering performance is evaluated before and after outlier removal across various similarity (SMC, Lin, Eskin) and linkage combinations (single, complete, average). Internal (Silhouette Index, Dunn Index) and external (Adjusted Rand Index, Purity) validation metrics are employed to assess clustering quality. Results show that outliers negatively impact cluster quality, while targeted detection methods improve clustering robustness. The findings offer methodological guidance for analysts dealing with categorical data in domains such as market segmentation, social science, and health informatics.
Klíčová slova:
Qualitative Data; Hierarchical Cluster Analysis; Outliers in Clustering