Session Recap: Bringing Analysts into the Data Transformation Workflow
In the dbt Live: Expert Series, Solution Architects from dbt Labs address audience questions and cover topics like how to design a deployment workflow or refactor stored procedures for dbt. Sign up to join future sessions!
In the latest session, Mike Burke discussed how and why to engage analysts – and other data users outside of data engineering – in the data transformation process.
- Watch a replay of Mike’s session
- Why and how to make the data transformation process more accessible to analysts
- Centralize, distribute, or pod? Structure your data team for better collaboration across functions
- Q&A with attendees
Why Get Data Analysts Involved?
It’s Monday morning, and a key business metric looks spiked on a dashboard. The Slack messages start to mount: Did something break? What went wrong and where? Which other dashboards and data products are affected, and who needs to step in to troubleshoot?
If analysts can easily find and interpret the data lineage for this dashboard, they can take steps to research the root cause, communicate to the broader team, and localize (or even resolve) the problem before pulling in a data engineer.
Imagine that an analyst spots the spiking metric in a
yearly_part_rollup dashboard, and can proceed to do the following:
Look up the dashboard in question in a shared documentation website for your dbt project.
See the owner and description defined for that dashboard, which help answer “Who do I need to notify about this issue?”
Find a graph visualization for the data sources that feed into the
yearly_part_rollupdashboard. This helps answer questions like, “What does this dashboard rely on?” and “What other dashboards or data models might be affected by this data issue?”
See the owner, tests, and expectations defined for the source data, and the timestamp for when it last updated. This helps answer questions like “Did the data source refresh on the expected schedule? If not, who do I contact to troubleshoot why it went stale?”
You can make your dbt project easier for teammates to navigate with the following practices:
- Declare sources, to define data loaded into your warehouse, and make those points of origin visible in the lineage graph dbt generates for your project.
- Build modular data models, so each transformation can be easier to interpret, test, and reuse.
- Use consistent naming conventions to help teammates understand what types of transformations live where.
- Define tests, to quality-check data before it reaches your dashboards and stakeholders.
- Declare exposures, to define downstream uses of your dbt-transformed data (such as dashboards or ML pipelines that read the transformed data). Sample use case: This can help with identifying all the dashboards that would be impacted if you deprecate or change a specific source.
- Add status tiles to your dashboards, to signal the freshness and quality of the underlying data.
What’s the Most Efficient Way to Organize My Data Team?
Next, Mike touched on a question he hears frequently on the job (also a common topic of debate in the dbt Slack community):
What are pros and cons you’ve seen with centralized, embedded, and hybrid data team structures, and where should analytics engineering fit in?
While the “right” answer will depend largely on the size, culture, and data needs of your wider organization, Mike outlined some considerations for each approach below.
In this approach, data engineers and analysts sit in the same team, but work at a remove from business units and data end-users.
- Data engineers and analysts work closely together, can learn from each other, and are more likely to stay aligned on priorities.
- The centralized team can maintain a high degree of control & governance over data produced for stakeholders.
- Difficult to scale for large or growing organizations.
- Analysts have a steeper learning curve for the business context of every piece of analysis they need to handle.
- Business units may feel disassociated from the data team – unclear on how long things should take and why, less invested in improving the quality of data they produce.
Decentralized / Embedded
This typically consists of a core engineering or analytics engineering team, and analysts embedded in each business unit.
- Business units may feel more engaged with data, with a dedicated analyst immersed in their use cases and routines.
- Embedded analysts have accumulated context on their unit’s KPI definitions, priorities, and other reference points – all of which may help them turn analyses around more quickly.
- DAs and DEs collaborate less closely, share knowledge less regularly, and may find their goals misaligned.
- Analysts may feel isolated within embedded teams, with fewer opportunities to interact with peers in their own discipline.
- Analyst managers may have a harder time evaluating job performance and guiding career growth for direct reports that are distributed across different teams.
Hybrid / Pod-Based
In this approach, each business unit is aligned with a domain-focused data “pod”. Each pod is led by someone (often an analyst) with relevant domain expertise, who scopes data projects and scales the team (with other analysts or data engineers) as needed.
- Stronger collaboration and knowledge share between business and data pods
- Pod leaders have a sense of ownership and the context to make informed decisions.
- Centralized pod structure allows analysts and engineers to stay aligned on goals, while maintaining business context & focus for individual projects.
For a deeper dive, Mike recommends the Coalesce talk and blog post that David Murray of Snaptravel has dedicated to this topic.
After the discussion, Mike and fellow Solutions Architect Victoria Perez Mola answered a range of attendee questions.
To hear all of the Q&A, replay the video (starting ~32:00) and visit the #events-dbt-live-expert-series Slack channel to see topics raised in the chat.
A sample of questions:
- (Pre-submitted) - How do I send alerts on my dbt jobs to Slack? (Resource)
- (33:00) - As a SaaS company, how can we structure a single dbt project to support multiple customer datasets? Should we avoid this altogether? (Resource)
(39:30) - How do I optimize my incremental data updates? (Resource)
- Mike also recommends checking out this talk from Tier Mobility on their incremental strategy.
Last modified on: Mar 10, 2023