Building resilience: Observability for modern data teams

on Sep 24, 2025
Data observability has emerged as a distinct discipline that goes beyond traditional data quality monitoring. While data quality focuses on the accuracy and completeness of data itself, observability encompasses the broader visibility into data systems—understanding not just what happened, but when, why, and how it impacts downstream processes.
The need for observability becomes particularly acute during periods of data downtime, when data is partial, erroneous, missing, or inaccurate. These incidents can cascade through an organization, affecting everything from executive dashboards to automated marketing campaigns. The cost of poor data quality extends beyond immediate operational impacts—it erodes trust in data systems and can lead to hesitation in making data-driven decisions.
Modern data stacks, while powerful and flexible, have actually increased the surface area for potential issues. The proliferation of data sources, transformation tools, and consumption endpoints creates more points of failure and makes it harder to maintain end-to-end visibility. This fragmentation makes observability not just helpful, but essential for maintaining reliable data operations at scale.
Learning from SurveyMonkey's approach
SurveyMonkey's data engineering team provides a compelling example of how organizations can build resilience through observability. When the team conducted an internal survey to understand data challenges across the organization, they discovered that 53% of respondents cited data quality issues as a primary concern. More surprisingly, 50% identified data processing issues—problems with pipeline performance and reliability—as equally significant challenges.
This discovery led to a systematic approach that combined two complementary strategies: reactive monitoring through Monte Carlo and proactive data management through dbt. Rather than treating these as separate initiatives, SurveyMonkey integrated them to create a comprehensive observability framework that addressed both data quality and processing performance.
The company's data landscape provides context for the scale of this challenge. SurveyMonkey processes data from nearly 900 sources, maintains 600 dbt models with 2,000 test cases, and runs over 180 workflows daily. This complexity requires sophisticated monitoring and alerting capabilities that can operate reliably regardless of pipeline success or failure.
Integrating reactive monitoring and proactive testing
The integration of Monte Carlo and dbt at SurveyMonkey demonstrates how reactive monitoring and proactive testing can work together synergistically. dbt serves as the foundation for data transformation, providing models that clean raw data from various sources and create high-quality, usable datasets. The dbt testing framework verifies data quality through automated checks, while dbt documentation creates consistent, well-documented models that serve as single sources of truth.
Monte Carlo complements this foundation by monitoring dbt models and pipelines in production, detecting anomalies and ensuring ongoing data accuracy. The real power emerges from how these tools work together rather than in isolation.
One of the most effective practices SurveyMonkey implemented was converting Monte Carlo anomalies into dbt test cases. When the monitoring system detected an anomaly that indicated a serious data quality issue, the team would create a corresponding dbt test that would prevent the pipeline from proceeding if the same condition occurred again. This approach shifts the responsibility for data quality upstream, enabling business users to address issues at their source rather than waiting for data engineering intervention.
This integration also standardized quality checks across all models. Every dbt model became subject to mandatory testing, creating a consistent baseline for data quality expectations. Regular performance reviews ensured that models didn't degrade over time, while automated monitoring and alerting provided proactive notifications when issues arose.
Measuring the impact of observability
The results of SurveyMonkey's observability implementation demonstrate the tangible business value of comprehensive monitoring and testing. The most striking outcome was a 73% reduction in Snowflake credit usage across nearly 10,000 credit jobs. This dramatic improvement came primarily from performance monitoring that identified inefficient queries, unused models, and optimization opportunities.
Performance monitoring revealed long-running jobs performing unnecessary cross-joins, enabling the team to simplify SQL statements and merge redundant queries. The systematic identification and removal of unused models and tables further contributed to cost reduction. In one particularly impressive case, obsoleting unused models and revamping inefficient queries resulted in a 94% reduction in pipeline runtime and a 97% reduction in Snowflake credit usage.
Beyond cost savings, the observability framework enabled SurveyMonkey to scale while keeping costs stable. Despite adding many more dbt models—including bringing a marketing analytics platform in-house—job execution times remained stable while cost per credit steadily decreased. This demonstrates how effective observability can support growth without proportional increases in operational overhead.
Data quality improvements were equally significant. When SurveyMonkey integrated the third-party marketing analytics platform, Monte Carlo initially detected a spike in data anomalies as the team learned to work with the new data sources. However, the systematic conversion of anomalies into dbt test cases led to a corresponding decrease in anomalies over time, creating a self-improving system that became more robust with experience.
Building observability with dbt artifacts
While third-party observability tools provide valuable capabilities, teams can also build significant observability using dbt's native artifacts and metadata. dbt generates detailed artifacts after every run, test, or build command, containing granular information about model execution, test results, and pipeline performance.
These artifacts serve as a rich data source for custom observability solutions. The project manifest provides complete configuration information for the dbt project, while run results artifacts contain detailed execution data for models, tests, and other resources. When combined with data warehouse query history, these artifacts enable deep insights into model-level performance that can inform optimization decisions.
Teams have successfully built lightweight ELT systems that ingest artifact data into their data warehouses, then use dbt itself to transform this metadata into structured models that power dashboards and alerting systems. This approach leverages existing infrastructure and skills while providing customizable observability tailored to specific organizational needs.
The key components of such a system include orchestration that reliably captures artifacts regardless of pipeline success or failure, storage that preserves historical artifact data for trend analysis, modeling that transforms raw artifacts into actionable insights, and alerting that notifies relevant stakeholders when issues arise.
Implementing effective alerting strategies
Effective alerting represents one of the most critical aspects of data observability, yet it's often implemented poorly. The goal is to provide timely, actionable notifications to the right people without creating alert fatigue or overwhelming teams with false positives.
SurveyMonkey's approach demonstrates several best practices for data alerting. First, they implemented domain-specific tagging that allows alerts to be routed to appropriate team members based on model ownership. Every model in their dbt deployment includes domain tags like "growth," "finance," or "catalog," which correspond to Slack user groups containing relevant stakeholders.
This targeted alerting ensures that model owners receive notifications about their specific models rather than broadcasting alerts to entire teams. The alerts include sufficient context for debugging, including error messages, model names, and timestamps, enabling recipients to quickly understand and address issues.
Importantly, the team learned not to introduce anomaly notifications to business users at the beginning of new data integrations. When data engineers themselves don't fully understand new data sources, it's counterproductive to alert business users about anomalies. Taking time to let pipelines stabilize and understand normal data patterns before involving business users prevents unnecessary noise and maintains alert credibility.
Performance optimization through observability
Beyond alerting and quality monitoring, observability data provides valuable insights for performance optimization. By combining dbt artifacts with data warehouse query history, teams can identify models that are candidates for different materialization strategies, clustering improvements, or warehouse sizing adjustments.
Performance dashboards can surface models with high execution times, excessive data spillage, or inefficient partition scanning patterns. Time series views of individual models help identify performance degradation over time, while pipeline-level visualizations reveal bottlenecks that affect overall execution times.
These insights enable data teams to make informed decisions about optimization priorities. Rather than guessing which models might benefit from incremental materialization or increased warehouse sizes, teams can use concrete performance data to guide their efforts and measure the impact of changes.
Lessons learned and best practices
The experiences of teams implementing data observability reveal several important lessons. First, the combination of proactive testing and reactive monitoring creates more resilient systems than either approach alone. dbt tests catch many issues before they reach production, while observability tools detect the problems that slip through.
Second, stopping pipelines on bad data gets everyone's attention and creates accountability for data quality. When important anomalies are detected, creating test cases that prevent pipeline execution until upstream issues are resolved shifts responsibility appropriately and prevents the propagation of known data quality problems.
Third, onboarding new users to observability systems from the beginning ensures adoption and effectiveness. Training business users to leverage observability tools for incident routing and data discovery creates a self-service culture that reduces the burden on data engineering teams.
Finally, the tools and approaches used for observability should integrate well with existing workflows and infrastructure. Solutions that require significant additional overhead or specialized expertise are less likely to be maintained effectively over time.
The future of data observability
As data systems continue to evolve, observability practices must adapt to new challenges and opportunities. The emergence of data mesh architectures and domain-driven data ownership creates new requirements for observability that spans organizational boundaries while maintaining appropriate access controls and governance.
Artificial intelligence and machine learning are beginning to enhance observability capabilities, from automated anomaly detection to intelligent alerting that reduces false positives. However, the fundamental principles of comprehensive monitoring, proactive testing, and effective alerting remain constant.
The most successful data teams will be those that treat observability as a core competency rather than an afterthought. This means investing in the tools, processes, and skills necessary to maintain visibility into data systems at scale, while fostering a culture that values data reliability and quality as essential business capabilities.
Building resilience through observability requires commitment and investment, but the returns—in terms of cost savings, improved reliability, and increased trust in data—justify the effort. As data becomes increasingly central to business operations, the organizations that master data observability will have a significant competitive advantage in their ability to make reliable, data-driven decisions at scale.
Data observability FAQs
Published on: Aug 25, 2025
Rewrite the future of data work, only at Coalesce
Coalesce is where data teams come together. Join us October 13-16, 2025 and be a part of the change in how we do data.
Set your organization up for success. Read the business case guide to accelerate time to value with dbt.
VS Code Extension
The free dbt VS Code extension is the best way to develop locally in dbt.