Reproducible and transparent data pipelines with dbt from Coalesce 2023
"I want to live in a world where we're not just prioritizing reproducibility for our most important models, but it's baked in, and we just do it for everything."
Christine Dixon, Lead Analyst from Mantel Group, shares her insights on the importance of data transparency and how to navigate the challenges of achieving reproducibility.
Transparency and reproducibility are crucial in data analytics
Drawing from her experience as an academic scientist during the reproducibility crisis, Christine highlights the potential risks of non-transparent practices in the data world. She explains, "In order to be transparent, your data needs to be accessible... it needs to be explained... and it needs to be reproducible. If your data is not reproducible, [and] if your analyses are not reproducible, they're probably not transparent." She acknowledges that, while many business decisions are still based on massive, complex spreadsheets, this method often leads to unreproducible data pipelines.
Christine argues that the solution lies in functional data engineering which assumes that "everything is deterministic. Your data plus your code gives the output." By ensuring that data is kept in its original state–that source data is preserved and the transformations are deterministic, it becomes easier to reproduce and defend data.
Transparency in data analytics is affected by various factors
Christine expands on the aspects that contribute to transparency in data, including accessibility, explanation, and reproducibility. She notes that these elements bear different weight under different circumstances, and it is essential to understand the context of the data for effective transparency.
She also highlights the need for a narrative or business logic to accompany the data: "Having some narrative, and some business logic, and explanations that go along with your data is also part of the transparency process."
Christine explains that the accessibility of data is crucial for transparency. Although this might not necessarily mean making the data publicly available, it should be accessible to someone—be it a boss, a CEO, a colleague, or even oneself in the future. This, she points out, helps maintain transparency in the data.
Technologies and best practices can help maintain transparency in data
"In the data world, we want to see more transparency. We want to be able to defend our data better, and we need tools to make that easy."
Christine shares that advancements in dbt have made maintaining defensible data simpler. Nevertheless, there are still traps and complexities that can make transparent data difficult, especially as projects grow in complexity.
She explains, "dbt really helps get us a long way down this pathway to make things reproducible. We're going to start with running the data we've already got, and then we move to just updating the new stuff that's come in."
However, she also cautions that as the business evolves and data evolves, code changes become necessary. This can complicate the transparency of data, making it harder to drop and rerun datasets. To overcome these challenges, Christine suggests the need for more evolution in our tools to make reproducibility and transparency easier and more automatic.
Christine's key insights
- Transparency in data is crucial. It involves making the data accessible, explained, and reproducible
- Reproducibility is key to transparency. If data or analyses are not reproducible, they are probably not transparent
- dbt makes defensible data simple. However, there are challenges and traps that can make transparent data hard as projects grow
- There’s a tension between maintaining every output and maintaining reproducibility