Effect of outliers on hierarchical cluster analysis of qualitative data
Název práce: | Effect of outliers on hierarchical cluster analysis of qualitative data |
---|---|
Autor(ka) práce: | Çırak, Elif |
Typ práce: | Diploma thesis |
Vedoucí práce: | Cibulková, Jana |
Oponenti práce: | Šulc, Zdeněk |
Jazyk práce: | English |
Abstrakt: | This thesis evaluates the effects of outliers on the performance of hierarchical clustering algorithms applied to qualitative data. Specifically, the study examines how the presence of outliers distorts clustering outcomes and whether the application of appropriate outlier detection techniques enhances cluster validity and structural stability. Hierarchical clustering, while widely used for exploratory analysis, is particularly sensitive to outliers, especially in the context of qualitative data where non-metric similarity measures and linkage methods are applied. This thesis investigates the influence of both individual and group outliers using synthetically generated categorical datasets and applies three detection techniques: Clustering-Based Outlier Detection, Local Outlier Factor, and Frequent Pattern Outlier Factor. Clustering performance is evaluated before and after outlier removal across various similarity (SMC, Lin, Eskin) and linkage combinations (single, complete, average). Internal (Silhouette Index, Dunn Index) and external (Adjusted Rand Index, Purity) validation metrics are employed to assess clustering quality. Results show that outliers negatively impact cluster quality, while targeted detection methods improve clustering robustness. The findings offer methodological guidance for analysts dealing with categorical data in domains such as market segmentation, social science, and health informatics. |
Klíčová slova: | Hierarchical Cluster Analysis; Outliers in Clustering; Qualitative Data; Sensitivity to Outliers |
Název práce: | Effect of outliers on hierarchical cluster analysis of qualitative data |
---|---|
Autor(ka) práce: | Çırak, Elif |
Typ práce: | Diplomová práce |
Vedoucí práce: | Cibulková, Jana |
Oponenti práce: | Šulc, Zdeněk |
Jazyk práce: | English |
Abstrakt: | This thesis evaluates the effects of outliers on the performance of hierarchical clustering algorithms applied to qualitative data. Specifically, the study examines how the presence of outliers distorts clustering outcomes and whether the application of appropriate outlier detection techniques enhances cluster validity and structural stability. Hierarchical clustering, while widely used for exploratory analysis, is particularly sensitive to outliers, especially in the context of qualitative data where non-metric similarity measures and linkage methods are applied. This thesis investigates the influence of both individual and group outliers using synthetically generated categorical datasets and applies three detection techniques: Clustering-Based Outlier Detection, Local Outlier Factor, and Frequent Pattern Outlier Factor. Clustering performance is evaluated before and after outlier removal across various similarity (SMC, Lin, Eskin) and linkage combinations (single, complete, average). Internal (Silhouette Index, Dunn Index) and external (Adjusted Rand Index, Purity) validation metrics are employed to assess clustering quality. Results show that outliers negatively impact cluster quality, while targeted detection methods improve clustering robustness. The findings offer methodological guidance for analysts dealing with categorical data in domains such as market segmentation, social science, and health informatics. |
Klíčová slova: | Qualitative Data; Hierarchical Cluster Analysis; Outliers in Clustering |
Informace o studiu
Studijní program / obor: | Economic Data Analysis/Data Analysis and Modeling |
---|---|
Typ studijního programu: | Magisterský studijní program |
Přidělovaná hodnost: | Ing. |
Instituce přidělující hodnost: | Vysoká škola ekonomická v Praze |
Fakulta: | Fakulta informatiky a statistiky |
Katedra: | Katedra statistiky a pravděpodobnosti |
Informace o odevzdání a obhajobě
Datum zadání práce: | 31. 10. 2023 |
---|---|
Datum podání práce: | 26. 6. 2025 |
Datum obhajoby: | 20. 8. 2025 |
Identifikátor v systému InSIS: | https://insis.vse.cz/zp/89891/podrobnosti |