Design and Evaluation of a Model for Continuous Monitoring of Data Reliability
Autor(ka) práce:
Erbaşı Koşalay, Ayça
Typ práce:
Diploma thesis
Vedoucí práce:
Kučera, Jan
Oponenti práce:
Karkošková, Soňa
Jazyk práce:
English
Abstrakt:
This thesis designs, implements, and evaluates a continuous monitoring model for data reliability in a cloud-native Snowflake data platform, using Datadog for monitoring and dashboards and Slack and PagerDuty for alert routing. Organisations increasingly depend on analytical data products, yet data that is late, incomplete, or rule-breaking undermines decisions and creates operational and financial risk; despite this, reliability monitoring in modern data platforms often remains ad hoc and reactive. Following a Design Science Research (DSR) methodology structured around relevance, design, and rigour cycles, the research derives eight design requirements from the problem context and the literature and produces a reusable artefact: a compact catalogue of Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for the freshness, completeness, validity, volume, and schema dimensions; a four-tier severity taxonomy (P1–P4); a pipeline ownership registry; an escalation and routing matrix; and runbook templates. The model is implemented with Snowflake Tasks and Datadog monitors and evaluated using a synthetic dataset and four demonstration scenarios, complemented by an expert interview. The evaluation shows that the SRE SLI/SLO abstraction transfers directly to data pipeline monitoring, that detection latency is bounded by the monitor evaluation window, and that automated severity-based routing reduces manual triage. The principal contribution is the domain transfer of Site Reliability Engineering principles to data pipeline observability, delivered as a deployable and transferable operating model.
Klíčová slova:
data quality; Snowflake; Datadog; Slack; PagerDuty; service levels; Data reliability; SLI and SLO; observability; incident management; MTTD; MTTR; runbooks; design science research
Název práce:
Design and Evaluation of a Model for Continuous Monitoring of Data Reliability
Autor(ka) práce:
Erbaşı Koşalay, Ayça
Typ práce:
Diplomová práce
Vedoucí práce:
Kučera, Jan
Oponenti práce:
Karkošková, Soňa
Jazyk práce:
English
Abstrakt:
This thesis designs, implements, and evaluates a continuous monitoring model for data reliability in a cloud-native Snowflake data platform, using Datadog for monitoring and dashboards and Slack and PagerDuty for alert routing. Organisations increasingly depend on analytical data products, yet data that is late, incomplete, or rule-breaking undermines decisions and creates operational and financial risk; despite this, reliability monitoring in modern data platforms often remains ad hoc and reactive. Following a Design Science Research (DSR) methodology structured around relevance, design, and rigour cycles, the research derives eight design requirements from the problem context and the literature and produces a reusable artefact: a compact catalogue of Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for the freshness, completeness, validity, volume, and schema dimensions; a four-tier severity taxonomy (P1–P4); a pipeline ownership registry; an escalation and routing matrix; and runbook templates. The model is implemented with Snowflake Tasks and Datadog monitors and evaluated using a synthetic dataset and four demonstration scenarios, complemented by an expert interview. The evaluation shows that the SRE SLI/SLO abstraction transfers directly to data pipeline monitoring, that detection latency is bounded by the monitor evaluation window, and that automated severity-based routing reduces manual triage. The principal contribution is the domain transfer of Site Reliability Engineering principles to data pipeline observability, delivered as a deployable and transferable operating model.
Klíčová slova:
Data reliability; data quality; Snowflake; SLI and SLO; incident management; MTTD; MTTR; runbooks; design science research; Datadog; Slack; PagerDuty; service levels; observability