This thesis investigates the influence of outliers on the results of cluster analysis and explores methods for their detection. In the era of big data we live in, clustering is an essential tool for extracting information. However, the presence of outliers can lead to misleading results. This research quantifies the impact of outliers on the quality of clusters and the accuracy of classification, considering both quantitative and qualitative data separately. Furthermore, the evaluation of the se... show full abstractThis thesis investigates the influence of outliers on the results of cluster analysis and explores methods for their detection. In the era of big data we live in, clustering is an essential tool for extracting information. However, the presence of outliers can lead to misleading results. This research quantifies the impact of outliers on the quality of clusters and the accuracy of classification, considering both quantitative and qualitative data separately. Furthermore, the evaluation of the selected outlier detection methods are performed based on their precision. The motivation for this thesis comes from the lack of studies in this area, particularly concerning the accuracy of classification using clustering. The experimental analysis is conducted using well-known Iris dataset, into which various types of outliers are syntetically inserted. Additionally, binning is applied to obtain categorical variables. The results demonstrate how the presence of outliers influence the quality of clusters and the precision of classification for both data types, and compare the effectiveness of various outlier detection methods. This paper contributes to a better understanding of the impacts of outliers on cluster analysis and provides guidance for selecting appropriate outlier identification methods based on data and outlier types. |