Effect of outliers on hierarchical cluster analysis of qualitative data

Název práce: Effect of outliers on hierarchical cluster analysis of qualitative data
Autor(ka) práce: Çırak, Elif
Typ práce: Diploma thesis
Vedoucí práce: Cibulková, Jana
Oponenti práce: Šulc, Zdeněk
Jazyk práce: English
Abstrakt:
This thesis evaluates the effects of outliers on the performance of hierarchical clustering algorithms applied to qualitative data. Specifically, the study examines how the presence of outliers distorts clustering outcomes and whether the application of appropriate outlier detection techniques enhances cluster validity and structural stability. Hierarchical clustering, while widely used for exploratory analysis, is particularly sensitive to outliers, especially in the context of qualitative data where non-metric similarity measures and linkage methods are applied. This thesis investigates the influence of both individual and group outliers using synthetically generated categorical datasets and applies three detection techniques: Clustering-Based Outlier Detection, Local Outlier Factor, and Frequent Pattern Outlier Factor. Clustering performance is evaluated before and after outlier removal across various similarity (SMC, Lin, Eskin) and linkage combinations (single, complete, average). Internal (Silhouette Index, Dunn Index) and external (Adjusted Rand Index, Purity) validation metrics are employed to assess clustering quality. Results show that outliers negatively impact cluster quality, while targeted detection methods improve clustering robustness. The findings offer methodological guidance for analysts dealing with categorical data in domains such as market segmentation, social science, and health informatics.
Klíčová slova: Hierarchical Cluster Analysis; Outliers in Clustering; Qualitative Data; Sensitivity to Outliers
Název práce: Effect of outliers on hierarchical cluster analysis of qualitative data
Autor(ka) práce: Çırak, Elif
Typ práce: Diplomová práce
Vedoucí práce: Cibulková, Jana
Oponenti práce: Šulc, Zdeněk
Jazyk práce: English
Abstrakt:
This thesis evaluates the effects of outliers on the performance of hierarchical clustering algorithms applied to qualitative data. Specifically, the study examines how the presence of outliers distorts clustering outcomes and whether the application of appropriate outlier detection techniques enhances cluster validity and structural stability. Hierarchical clustering, while widely used for exploratory analysis, is particularly sensitive to outliers, especially in the context of qualitative data where non-metric similarity measures and linkage methods are applied. This thesis investigates the influence of both individual and group outliers using synthetically generated categorical datasets and applies three detection techniques: Clustering-Based Outlier Detection, Local Outlier Factor, and Frequent Pattern Outlier Factor. Clustering performance is evaluated before and after outlier removal across various similarity (SMC, Lin, Eskin) and linkage combinations (single, complete, average). Internal (Silhouette Index, Dunn Index) and external (Adjusted Rand Index, Purity) validation metrics are employed to assess clustering quality. Results show that outliers negatively impact cluster quality, while targeted detection methods improve clustering robustness. The findings offer methodological guidance for analysts dealing with categorical data in domains such as market segmentation, social science, and health informatics.
Klíčová slova: Qualitative Data; Hierarchical Cluster Analysis; Outliers in Clustering

Informace o studiu

Studijní program / obor: Economic Data Analysis/Data Analysis and Modeling
Typ studijního programu: Magisterský studijní program
Přidělovaná hodnost: Ing.
Instituce přidělující hodnost: Vysoká škola ekonomická v Praze
Fakulta: Fakulta informatiky a statistiky
Katedra: Katedra statistiky a pravděpodobnosti

Informace o odevzdání a obhajobě

Datum zadání práce: 31. 10. 2023
Datum podání práce: 26. 6. 2025
Datum obhajoby: 2025

Soubory ke stažení

Soubory budou k dispozici až po obhajobě práce.

    Poslední aktualizace: