dbt
Blog A guide to DataOps with dbt

A guide to DataOps with dbt

Deploying new data projects at speed and scale requires a different approach than data engineering teams have used in the past. The scale of data and the speed of business require a process that enables teams to ship quickly and with high quality.

DataOps is an approach to data development that aims to break down the barriers between data producers and data consumers that slow down projects. Successfully embracing a DataOps approach requires not just understanding the methodology but adopting tools that help automate the process of designing, developing, testing, and deploying data into production.

In this article, we’ll look at the methodology behind DataOps, as well as how dbt as a toolset helps teams successfully implement that methodology and embrace a DataOps culture.

A DataOps primer

DataOps is patterned after its predecessor, DevOps. DevOps was created in the world of software engineering to eliminate silos between software development and IT operations teams to create a smoother, streamlined, and unified process for deploying software. In a similar vein, DataOps removes silos between data producers and data consumers to make data more available, reliable, valuable, and accessible.

DevOps addressed issues in legacy “waterfall” development cycles, where developers would hand software off to IT, which would then struggle to deploy and maintain it. This resulted in slow software development cycles, buggy software, and ultimately, unhappy customers. DevOps encouraged developers and system administrators to eliminate this friction, working closely together to create a unified process and toolset that enabled them to develop, test, deploy, and maintain applications smoothly.

DataOps collapses the barriers between data producers and data consumers, who can get trapped in similar cycles where producers ship data products that don’t meet data consumers’ needs. DataOps emphasizes the need for data pipeline automation to enhance the efficiency and reliability of data transformations. In DataOps, the data team owns the full data lifecycle - design, dev, deployment, and observability - empowering them to work closely with stakeholders on delivering high-quality, relevant data with velocity.

Methodology: The core stages of DataOps

So how does DataOps work? We can think of it as encompassing five phases:

  • Plan
  • Build
  • Deploy
  • Monitor
  • Catalog

As with DataOps, these phases shouldn’t be thought of as long, drawn-out projects that take months from conception to completion. Rather, teams often work within an agile development framework, defining a short timespan of work (organized into “sprints”) at the end of which they deliver something of value to stakeholders. The process repeats over and over, with each new iteration adding additional functionality and fixes in response to stakeholder feedback.

Plan

In the planning phase, the data team works with stakeholders to understand what data they need, the format in which they need it, how quickly they need it updated, where the data will be sourced from, etc. It’s at this stage that both teams also set various Key Performance Indicators (KPIs) and Service Level Agreements (SLAs) around things such as data quality, data freshness, query performance, etc.

Build

In the build phase, the data team creates the data sets they’ll deliver to stakeholders. This involves building:

  • Data models
  • Data transformations
  • Tests to gauge and certify quality
  • Documentation describing the data, its business purpose, and how it’s calculated

Making tests and documentation a part of the build phase is a hallmark of both DevOps and DataOps. Tests ensure that data meets business requirements as well as KPIs and SLAs around quality and performance. Documentation ensures that stakeholders know exactly what business purpose a model serves, where its data comes from, and how it’s calculated. This makes data quality an integral part of the development process rather than an afterthought.

Deploy

In the deploy phase, the data teams pushes their data set changes out of local development and through a series of environments, testing their changes at each stage to ensure they behave as expected. The goal is to ensure that every change is rigorously tested and validated before it’s pushed to production and made available to data consumers.

For example, a team may release a change to a Staging environment, where an automated process runs the data team’s tests on sample data and collects metrics on data quality and query performance. If all tests and metrics checks pass, the process may then push to Production, where the team will run the same set of tests on real-world data.

The combination of automation and version control means that teams can both deploy and roll back changes as needed. If a team identifies issues, say, with their Version 2 release in production, they can roll the deployment back to Version 1 easily. This enables stakeholders to continue using the system while the engineering team addresses critical issues.

Monitor

Once deployed into production, data consumers can self-service access to the data, using it in their own reports and data products. During this time, the data team will continue to observe metrics, logs, and traces as data updates flow in, responding to any identified data anomalies or performance issues.

Catalog

All of these workflows create extensive metadata that gives stakeholders insight into how data products are built, used, optimized, and debugged. This metadata can be visualized and explored in a catalog, where the data team ensures that its work is discoverable and documented. This might entail:

  • Making data sets available for discovery in a data catalog
  • Generating data lineage charts to document a data sets sources, dependencies, owners, and relationships to other data sets
  • Publishing data models and documentation so other teams can consume their work

Cataloging helps reduce data silos and redundant data transformation work. Before embarking on a new data project, a team can search a catalog to discover if another team has already created a high-quality data set off of which they can build.

Tooling: How dbt supports the five stages of DataOps

At dbt, we’ve long believed in and supported DataOps as a methodology. That’s why dbt supports DataOps through every step of the process, making it easier to run your business on trusted data.

Plan: dbt Mesh

dbt Cloud offers several built-in features that help users plan and align on their data projects:

  • dbt Mesh: dbt Mesh enables companies to implement a data mesh architecture - a decentralized, scalable approach to data management. With dbt Mesh, data teams can design their data workflow architecture to support the unique needs of their downstream stakeholders with domain-specific data, and do so in a governed, automated, simplified way. This enables them to work independently with other teams without sacrificing collaboration, governance, or security.

  • SLAs and data freshness: dbt also provides helpful interfaces for source data freshness calculations. These interfaces are designed to help users determine if source data freshness is meeting pre-defined SLAs.

Build: Various development environments

dbt gained popularity as a tool that allowed developers to build analytics code using SQL. Given the varied skillsets and preferences of data collaborators within an organization, dbt supports a number of environments for authoring analytics code:

  • Cloud IDE: dbt Cloud integrated developer environment (IDE), a web-hosted IDE that includes SQL syntax highlighting, auto-completing, code linting, documentation, and build/test/run controls to run and debug work on demand.
  • CLI: Developers can build analytics code directly in their preferred CLI and use dbt command line tools to write, run, and debug model changes.
  • Visual editor: Soon, less SQL-savvy analysts will be able to create or edit dbt models through a visual, drag-and-drop experience inside of dbt Cloud. These models compile directly to SQL and are indistinguishable from other dbt models in your projects: they are version controlled, can be accessed across projects in a dbt Mesh, and integrate with dbt Explorer and the Cloud IDE. As part of this visual development experience, users can also take advantage of built-in AI for custom code generation where the need arises.

Deploy: Scheduling jobs, version control and CI/CD

dbt Cloud includes an in-app job scheduler to automate how and when you execute dbt jobs. To improve development feedback loops and optimize data platform consumption, users can also “defer to production” for any job run, meaning that when they want to run and test changes to a single model, dbt will build only that changed model (and defer any upstream dependencies to what’s already in prod).

A major key to implementing DataOps is tracking changes to work. Without proper change tracking and control, a rogue alteration to source code that causes an error in production could take hours or more to hunt down and fix. dbt Labs' automated testing and version control streamline data pipeline automation, ensuring that teams spend less time on manual processes.

dbt supports version control via Git so that every change made to a model is committed, documented, and - if needed - reversible. Using Git, data team members keep their own versions of dbt models for development. When they’re ready to commit changes, they create a pull request (PR) that one or more other members of the team can review. This ensures that every change receives a second set of eyes before heading towards prod.

You can also configure dbt Cloud with Continuous Integration (CI) jobs to push changes from dev through prod. Once a PR is approved, it triggers a CI job, which runs and tests models in a staging (pre-production) environment before moving them to production.

Once a set of changes is verified in staging, dbt Cloud can push them to production, a process known as Continuous Deployment (CD). This combined CI/CD process automates code integration, testing, and delivery of updates to production, identifying and eliminating potentially costly errors before they’re shipped to data consumers. The result is reduced manual labor for deployments, resulting in accelerated data development cycles. By incorporating CI/CD for data, dbt helps organizations maintain data quality and consistency across deployments.

Monitor: Automated testing

With dbt, data teams can proactively define assertions—called tests—about their data models. These tests can be designed to validate the behavior of model logic before the model is materialized in production (unit tests), or about any assertion you want to make about your model (is unique, is non-null, etc). If a test fails, the model won’t build—saving you from unnecessary data platform spend, while improving data product reliability.

In addition to setting up tests to proactively catch issues, it’s easy to monitor your production dbt jobs and alert the right people when something goes wrong with Slack or email notifications, logs, run history dashboards, data health tiles, and more.

Catalog: dbt Explorer

Both data developers and consumers alike benefit from having an understanding of data dependencies, freshness, use cases, and other relevant context. dbt Explorer is an interactive data catalog that represents all of the metadata created in every dbt Cloud run in an intuitive, visual interface. Using dbt Explorer, consumers can find a data asset and view it in context, complete with its metadata, documentation, and data lineage. Data producers can use dbt Explorer to find reusable data assets, as well as to trace lineage to troubleshoot data issues resulting from upstream data defects.

The nice thing is that, rather than an afterthought in the DataOps process, the assets viewable by dbt Explorer - models, metadata, documentation, data lineage, security controls, etc. - are all automatically generated during the data development process itself. Updates are pushed automatically to dbt Explorer with every push to production. With dbt, teams can implement collaborative data workflows that reduce bottlenecks and empower faster decision-making across departments.

Conclusion

As a methodology, DataOps can eliminate barriers between data producers and data consumers, resulting in faster data development cycles and higher-quality data. Using dbt, data teams and shareholders can make the DataOps culture part of their daily workflows, building a data framework that combines speed and governance with distributed ownership.

See dbt Cloud in action and learn how it can support your DataOps journey—Book your personalized demo.

Last modified on: Oct 16, 2024

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts