IMPROVING THE QUALITY OF INFORMATION ON PRICES OF DWELLINGS ADVERTISED IN AGGREGATOR PLATFORMS

Název práce: Improving the Quality of Information on Prices of Dwellings Advertised in Aggregator Platforms
Autor(ka) práce: Pasternak, Dempsey
Typ práce: Diploma thesis
Vedoucí práce: Musil, Petr
Oponenti práce: -
Jazyk práce: English
Abstrakt:
This thesis focuses on improving data quality in real-estate aggregator datasets by identifying and consolidating duplicate listings of housing units that refer to the same underlying dwelling. Using a cross-section of 2,607 rental-flat advertisements mined from Sreality.cz in April 2026, the study develops a reproducible entity-resolution procedure combining rule-based data preparation and street-based candidate-pair construction with image-supported manual review, Random Forest classification, and graph-based entity-resolution methods, including correlation clustering. The procedure is designed to distinguish repeated advertisements, arising for example from parallel agency advertising, reposting, or listing revisions, from listings that merely share similar observable characteristics. Pairwise duplicate predictions are subsequently transformed into conservative, balanced, and expansive entity-level datasets, allowing the effect of alternative resolution assumptions to be assessed. The second-stage Random Forest model provides a stronger basis for duplicate detection than the preliminary model on the manually reviewed and graph-adjudicated validation data. Duplicate resolution reduces the observed advertised rental stock from 2,607 listings to 2,409 entities under the conservative regime, 2,377 under the balanced correlation-clustering regime, and 2,371 under the expansive regime. Its effect on municipal advertised nominal rent per square meter is comparatively modest, although stock and rent-related changes are more uneven across Prague districts. The resulting entity-level datasets provide a more defensible basis for analyzing advertised rental supply and price conditions, demonstrating that duplicate identification is a necessary methodological step before online listings are used to represent Prague’s rental market.
Klíčová slova: random forest; correlation clustering; entity resolution; deduplication; online housing listings
Název práce: IMPROVING THE QUALITY OF INFORMATION ON PRICES OF DWELLINGS ADVERTISED IN AGGREGATOR PLATFORMS
Autor(ka) práce: Pasternak, Dempsey
Typ práce: Diplomová práce
Vedoucí práce: Musil, Petr
Oponenti práce: -
Jazyk práce: English
Abstrakt:
This thesis focuses on improving data quality in real-estate aggregator datasets by identifying and consolidating duplicate listings of housing units that refer to the same underlying dwelling. Using a cross-section of 2,607 rental-flat advertisements mined from Sreality.cz in April 2026, the study develops a reproducible entity-resolution procedure combining rule-based data preparation and street-based candidate-pair construction with image-supported manual review, Random Forest classification, and graph-based entity-resolution methods, including correlation clustering. The procedure is designed to distinguish repeated advertisements, arising for example from parallel agency advertising, reposting, or listing revisions, from listings that merely share similar observable characteristics. Pairwise duplicate predictions are subsequently transformed into conservative, balanced, and expansive entity-level datasets, allowing the effect of alternative resolution assumptions to be assessed. The second-stage Random Forest model provides a stronger basis for duplicate detection than the preliminary model on the manually reviewed and graph-adjudicated validation data. Duplicate resolution reduces the observed advertised rental stock from 2,607 listings to 2,409 entities under the conservative regime, 2,377 under the balanced correlation-clustering regime, and 2,371 under the expansive regime. Its effect on municipal advertised nominal rent per square meter is comparatively modest, although stock and rent-related changes are more uneven across Prague districts. The resulting entity-level datasets provide a more defensible basis for analyzing advertised rental supply and price conditions, demonstrating that duplicate identification is a necessary methodological step before online listings are used to represent Prague’s rental market.
Klíčová slova: entity resolution; deduplication; correlation clustering; random forest; online housing listings

Informace o studiu

Studijní program / obor: Economic Data Analysis/Data Analysis and Modeling
Typ studijního programu: Magisterský studijní program
Přidělovaná hodnost: Ing.
Instituce přidělující hodnost: Vysoká škola ekonomická v Praze
Fakulta: Fakulta informatiky a statistiky
Katedra: Katedra ekonomické statistiky

Informace o odevzdání a obhajobě

Datum zadání práce: 20. 10. 2025
Datum podání práce: 25. 6. 2026
Datum obhajoby: 2026

Soubory ke stažení

Soubory budou k dispozici až po obhajobě práce.

    Poslední aktualizace: