dbt

Scaling dbt models for CDC on large databases from Coalesce 2023

Santona Tuli, director of data at Upsolver, explains how her team used change data capture (CDC) to scale their data.

"Data quality is a journey."

- Santona Tuli, director of data at Upsolver

Santona Tuli, director of data at Upsolver, explains how the technique of change data capture (CDC) scaled their data. She also explains the importance of shifting left with dbt models and the challenges of ingesting data from production databases.

The essence of scalability in data management involves robustness, developer experience, and data volume adjustments

Santona elaborates on the concept of scale in the world of data management, where it's not only about handling an increasing amount of data but also about robustness and developer experience. Santona emphasizes that a scalable system should be designed to adjust according to data volume changes–and not just be optimized for maximum capacity.

She states, "So, to me, scale is synonymous with robustness... robust architectures will scale up when needed but will not waste resources when there isn't a need." This perspective underscores the importance of resource optimization in creating scalable systems.

Santona also discusses the concept of “scaling developer experience.” She suggests that developer productivity and satisfaction can be enhanced by using tools that are enjoyable and extend the ownership of the developer. Santona states, "What makes me scale the best is when I can extend my [enjoyment] of my work to more."

Data quality is a journey and requires a consistent effort throughout the data workflow

Santona emphasizes that achieving high-quality data is a journey rather than a one-time task, requiring consistent effort, starting from the beginning and continuing through to delivery.

She explains, "Data quality is a journey. You can't just sprinkle little bits here and there and say ‘Okay, I have high-quality data products.’ You have to be consistent. You’ve got to start at the beginning and follow through to delivery."

Santona also sheds light on the issues that often arise when ‌control over data quality is left to upstream processes, including decisions on backfilling, data freshness, ordering, deduplication, and schema changes. She argues that these decisions should ideally involve input from the analytics engineers who work with the data downstream to prevent context from getting lost and to ensure a more efficient data workflow.

Ingesting production databases using change data capture (CDC) can be scaled up using dbt

Santona highlights the challenges faced when dealing with large production data and the potential solutions for scaling up the ingestion process. She also discusses the process of change data capture (CDC), which is the process of capturing changes made at the data source and applying them to the data target. However, she points out that scalability can become an issue with production data due to its potentially massive size.

She states, "CDC solutions, unfortunately, require you to do ‘per-table pipeline,’ which is super annoying because you have to say ‘Okay, this is a table... I have a pipeline to capture the data from there and put it into Snowflake,’ and then you have to do it again for items…so that is something that is clearly not scalable when you have hundreds or even thousands of tables."

To tackle this, she suggests using a dbt model for the CDC ingestion process, which allows the extension of the dbt experience to ingestion as well. She explains, "Just with a few lines of config code…I've scaled my dev experience from just transformation to ingestion as well... and most importantly, I've scaled out the ownership, so I can now work with the data from the CDC database all the way through to the delivery of my data products."

Insights surfaced

  • Scaling is not just about handling increased data volume. It also involves being robust enough to handle schema evolution, data type changes, and data volume variability
  • Developer enjoyment is crucial in scaling. Tools that developers enjoy using can make them faster and happier, which can help scale the data engineering process
  • Data quality is a journey that starts from the beginning and follows through to delivery
  • Production data can get quite large, but it's essential for understanding business operations
  • Change data capture (CDC) with a tool or engine that has quality, robustness, and scaling built into it is an effective solution for ingesting data from production databases