Effect of outliers on hierarchical cluster analysis of qualitative data

Thesis title: Effect of outliers on hierarchical cluster analysis of qualitative data
Author: Çırak, Elif
Thesis type: Diploma thesis
Supervisor: Cibulková, Jana
Opponents: Šulc, Zdeněk
Thesis language: English
Abstract:
This thesis evaluates the effects of outliers on the performance of hierarchical clustering algorithms applied to qualitative data. Specifically, the study examines how the presence of outliers distorts clustering outcomes and whether the application of appropriate outlier detection techniques enhances cluster validity and structural stability. Hierarchical clustering, while widely used for exploratory analysis, is particularly sensitive to outliers, especially in the context of qualitative data where non-metric similarity measures and linkage methods are applied. This thesis investigates the influence of both individual and group outliers using synthetically generated categorical datasets and applies three detection techniques: Clustering-Based Outlier Detection, Local Outlier Factor, and Frequent Pattern Outlier Factor. Clustering performance is evaluated before and after outlier removal across various similarity (SMC, Lin, Eskin) and linkage combinations (single, complete, average). Internal (Silhouette Index, Dunn Index) and external (Adjusted Rand Index, Purity) validation metrics are employed to assess clustering quality. Results show that outliers negatively impact cluster quality, while targeted detection methods improve clustering robustness. The findings offer methodological guidance for analysts dealing with categorical data in domains such as market segmentation, social science, and health informatics.
Keywords: Hierarchical Cluster Analysis; Outliers in Clustering; Qualitative Data; Sensitivity to Outliers
Thesis title: Effect of outliers on hierarchical cluster analysis of qualitative data
Author: Çırak, Elif
Thesis type: Diplomová práce
Supervisor: Cibulková, Jana
Opponents: Šulc, Zdeněk
Thesis language: English
Abstract:
This thesis evaluates the effects of outliers on the performance of hierarchical clustering algorithms applied to qualitative data. Specifically, the study examines how the presence of outliers distorts clustering outcomes and whether the application of appropriate outlier detection techniques enhances cluster validity and structural stability. Hierarchical clustering, while widely used for exploratory analysis, is particularly sensitive to outliers, especially in the context of qualitative data where non-metric similarity measures and linkage methods are applied. This thesis investigates the influence of both individual and group outliers using synthetically generated categorical datasets and applies three detection techniques: Clustering-Based Outlier Detection, Local Outlier Factor, and Frequent Pattern Outlier Factor. Clustering performance is evaluated before and after outlier removal across various similarity (SMC, Lin, Eskin) and linkage combinations (single, complete, average). Internal (Silhouette Index, Dunn Index) and external (Adjusted Rand Index, Purity) validation metrics are employed to assess clustering quality. Results show that outliers negatively impact cluster quality, while targeted detection methods improve clustering robustness. The findings offer methodological guidance for analysts dealing with categorical data in domains such as market segmentation, social science, and health informatics.
Keywords: Qualitative Data; Hierarchical Cluster Analysis; Outliers in Clustering

Information about study

Study programme: Economic Data Analysis/Data Analysis and Modeling
Type of study programme: Magisterský studijní program
Assigned degree: Ing.
Institutions assigning academic degree: Vysoká škola ekonomická v Praze
Faculty: Faculty of Informatics and Statistics
Department: Department of Statistics and Probability

Information on submission and defense

Date of assignment: 31. 10. 2023
Date of submission: 26. 6. 2025
Date of defense: 2025

Files for download

The files will be available after the defense of the thesis.

    Last update: