Table of Contents

Data as Engineering

Raazia is a data engineer at GetUpside. She makes data dreams come true for data consumers across the company.

GetUpside is a platform that connects users with local merchants to help communities thrive. Our data team is small, but our footprint is large and rapidly growing, as we support $1B in local commerce.

In our talk, we’d like to share a deep dive into how we do it and why it works. In particular, we’ll talk about our overall philosophy, then dig into our development workflows and CI/CD pipelines. Finally we will cover how we scaled all that to put data engineering tools into the hands of our entire GetUpside team.

Follow along in the slides here.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Elize Papineau: [00:00:00] Hello, and thank you for joining us at Coalesce. My name is Elize Papineau. I’m a senior analytics engineer at dbt Labs, and I’ll be the host for this session. The title of this session is Data as Engineering and we’ll be joined by Raazia Ali, who is a senior software engineer at GetUpside.

Before we get started, I want everyone to take a moment to join us in Slack, please. You can join us in the coalesce-data-as-engineering channel of the dbt community Slack. If you’re not a part of the chat you can join right now. Please visit community.getdbt.com and search for the coalesce-data-as-engineering channel.

When you enter the space, once you’re in the chat, please feel free to make comments, react to the session. Post [00:01:00] any questions you have for the speaker or other attendees as the session proceeds. After the session, the speaker will hang out in the Slack channel to answer your questions, but we encourage you to interact in this channel throughout the presentation.

Let’s get started over to you, Raazia.

Raazia Ali: Thank you so much, Elize. Hi my name is Raazia Ali, and I’m a data engineer at GetUpside. And today I am going to be talking about how a mighty team of four engineers supported GetUpside to be contributing 2 billion in local commerce using engineering best practices.

We like to call it data as engineering. So this is our story about things that work for us, lessons that we learned along the way, and the concrete results that we produced as an outcome. We will walk through exactly how our data team of [00:02:00] four has gone from being the data experts to enabling all team members to data, to make better business decisions.

But first let’s start with who we are. GetUpside is bringing online personalization and attribution tools to the offline world to make brick and mortar commerce more efficient. Our users experience us as a free mobile app that helps them earn cashback on their everyday items. So how is this making commerce more efficient?

So we partnered with over 30,000 local businesses to provide offers to users in their area. We ingest billions of records, representing millions of users and billions of dollars of commerce and personalized user experience to cashback offers. We deliver hundreds of millions of dollars as cashback to users and as incremental profit [00:03:00] to merchants.

And on top of that, we also give 1% of our revenue to sustainability efforts. So obviously data plays a crucial role at every junction of our business.

So it’s over 2 billion of gross merchandise value, GMV, running to our platform every year. We are working on a massive problem set and growing exponentially. Of course, the company’s data needs are growing much faster than our data team could ever grow. So by mid 2020, we had 160 team members and only two data engineers.

And by the end of 2020, we had 210 team members and only four data engineers. And on top of that, our data team was not able to keep up with the abundance of inbound requests. Despite their best efforts, requests would come in from [00:04:00] external customers and internal teams. We would have to prioritize all those requests against each other.

Our to-do list kept growing. Our response time stayed the same, and we were getting in the way of our own growth. So obviously that wasn’t working and something had to change. So we put our heads together and decided to change our mindsets. So we were looking for problem solving inspiration, which we found within GetUpside as an advice from our co-founder and VP of engineering, Rick, to avoid the service mentality.

Raazia Ali: So we were very inspired by a talk by Emilie and Taylor in last year’s dbt Coalesce about how to run your data team as a product team. We did some creative thinking and broke the problem down to the basic. At the heart of every dbt model is a SQL query. Data consumers have excellent SQL [00:05:00] skills.

So if we could somehow reduce the toolchain overhead of learning, developing, and productionalizing, we can potentially enable data consumers to build their own data pipelines using dbt. But the question was, how do we achieve that.

The solution we found comes down to the three points of the self-reinforcing cycle. Find willing partners across the company who were on board with the idea of self service if that meant getting their data needs satisfied faster. Acquire excellence in data ops to reduce toolchain overhead. And provide multidimensional learning and support resources to be able to scale faster by having patterns and answers available for the most common questions.

In other words, analysts are excellent data chefs. All we are doing here is showing them how to use the appliances. Once [00:06:00] learned, they can cook all they want to their heart’s desire or data needs. Internally, we are constantly learning from our experiences and improving our processes. Let’s start with the first point of the cycle, building partnerships.

[00:06:19] Explore the possibility of an engagement #

Raazia Ali: This is the enablement model we use to engage with the teams. It allows us to work in partnership to get them what they need and to teach them how to self-serve in the field.

The exploration of an engagement can be team initiated, where they may have any current data related blockers, or if there is a new project starting off, or it can be data engineering initiated. There, we explore if we can help our team, up-level their analytic capabilities using the enablement model. Next is to establish and [00:07:00] communicate the mechanics of quarter-long partnership with the expectation of training our data X contributor. On the data engineering side, two team members are identified as primary and secondary lead for implementation, training and coverage.

[00:07:20] Kick off the engagement by building models and training them #

Raazia Ali: The engagement is kicked off by data engineering team, building data pipelines for the team. During this phase, we understand their data needs while educating them to understand their own data needs. We also establish a weekly rhythm of business that includes a sprint commit by the data engineering team followed by a weekly stakeholder review meeting.

Next, we entered the training phase where we train contributors from the team to build their own models using hands-on training, consultation and core development. This also includes [00:08:00] reinterpreting the weekly rhythm of business. Now, both the data engineering, as well as enablement team commit towards a weekly sprint goal.

Weekly stakeholder review meetings are repurposed as core development sessions as needed in the self-serve. The data contributors from the team are mostly trained and are able to self-service to documentation, training videos, and example BR. Data engineering team transitions to a support role via office hours, design consultation, and Slack channels.

[00:08:38] Wrap up the engagement #

Raazia Ali: And engagement is officially wrapped up. Once the team has either enough data marking skills built to be self-sufficient or the teams data needs have been satisfied, for now.

The next step is to achieve excellence in data ops to reduce toolchain overhead. This is crucial [00:09:00] to reduce entry barrier for the non-engineers to get started with the dbt modeling.

Raazia Ali: All four of us in the data engineering team are software engineers by training who moved into the data space. So our approaches are rooted into the engineering best practices. So this is where the data as engineering comes in. And the heart of our toolset is dbt and Snowflake around which our whole data ops ecosystem is built.

Our primary mechanism to load raw data into Snowflake is Fivetran. We use PagerDuty for alerts about data load issues in Fivetran and for dbt built cloud issues. We have a weekly on-call rotation program in PagerDuty to have one team member on point to solve data and build issues and [00:10:00] to shield the rest of the team from disruptions.

All our data modeling is done in dbt for anything and everything that dbt does not control. We use Terraform to define our infrastructure escort, like Snowflake administration rules and permissions. And of course, for reproducibility, we dockerized our Terraform and dbt interactions.

Having a robust data offset of enabled us to set up reliable and fully separated, broad dev and CSED environments. We like to call them cycles of DBs.

So analytics is our product. It is structured according to the dbt recommended project set up that is starting from the raw database. Each model has three checkpoints: sources, staging and marts. We have [00:11:00] purpose-built marts by application or usage, but our analytics.analytics mart is our general purpose analysis layer.

All other purpose-built marts and data products, including Looker source from this analytics.analytics mart. On the opposite end of this cycle are the dbt user databases. These are developers and researchers playground. They have access to almost all of the product data in the dedicated databases to experiment, modify, and research with a peace of mind for them and us.

Then their experimentation cannot impact the product data in any way. While doing actual real work on real data, their experimentation can stay localized forever. Or if something needs to make it back into the prod, they can do a GitHub [00:12:00] PR. Our PR pipeline is integrated with dbt Cloud to test those changes.

Every push on PR triggers our dbt Cloud built in a dedicated CSED database with schemas. The data team owns the PR approval merge and release process. The last element of the cycle is the clone database. Every time the main release built is triggered in the dbt Cloud as the last step, it clones the product database, the user data base clones off of dbt clone instead of tightly from the dbt analytics, this provides an additional layer of tiding between prod and user environments. Notice that the sandboxes are on the opposite side of the database. Anything going in or coming out of either does not touch each other achieving complete [00:13:00] separation of environments.

Here is a visual representation of how the analytics.analytics schema goes through the whole cycle of dbt in Snowflake, starting from the prod analytics database to be copied over to the analytics clone and release time. Then it gets pulled into a user’s local sandbox from there. The PR process takes it to the dbt Cloud database and eventually back into the analytics prod. All our builds are run in the dbt Cloud. We have four types of builds. Our CI/CD and release builds are part of our cycle of dbt integrated by GitHub actions. Besides that we have scheduled data refresh and data test builds to ensure freshness and correctness of our analytics.[00:14:00]

[00:14:02] Data ops safeguards #

Raazia Ali: One last piece of our data offset of is our rules and permissions structure. All data team members by default are Snowflake accounts and sysadmin. All non-team members have standard read only access to our analytics mart. The basic analyst access can be upgraded with custom team or schema access to specific marts.

Besides that we have application rules granting required access to any specific application. Then we decided to switch to the enablement model. We were faced with an interesting problem. Our standard non-member rules were too restrictive by being read only by our data engineering role was to permissive with admin privilege.

In order to open up the dbt data modeling to non-team members, they needed to [00:15:00] be able to read, write, all the way from raw to the analytics layer. That’s where we developed the contributor role in order to settle on the right level of access. We decided to dogfood our own permissions. We moved admin roles to attach roles where they have to be explicitly assumed to upgrade and switch to contributor roles by default. We kept on adding privileges to the contributor role as we kept on hitting the roadblocks till we reached the right mix of permissions. So the contributor role is a power user now who has read access to the raw sources and staging schemas, as well as for privileges to operate on their own dbt user data.

[00:15:53] Self-service resources #

Raazia Ali: Last, but not the least the pillar that has enabled us to grow our enablement model [00:16:00] non-linearly is the provision of the self-service resources.

One of the things that we recognize and acknowledge internally and to our contributors is that we are enabling citizen data engineers on hardcore engineering toolchain. Without providing the resources for the contributors to refer back to in their own time and pace excellence in data ops and partnership with teams alone will not be as fruitful.

Therefore, we strive to create a culture of support around our enablement model by training documentation and live support. Our documentation is structured around very simple questions. Our reader would ask themselves, are they trying to get data into or out of Snowflake or work with data within Snowflake?

Overall, as a team, we have adopted the [00:17:00] handbook’s first methodology, or is really inspired by GitLabs’s handbook first approach. We have committed as a team to continue to contribute to the handbook and to answer questions to documentation. For example, if a question comes to us through any of our support channels or enablement, we strive to respond by pointing to a link in our documentation.

If the answer does not exist, we added first to the documentation and then point to it. Second, we had taken to presenting via our handbook. We do not create slide decks. Instead we update or augment to our documentation pages and present from there.

Besides documentation, we provide training resources to our [00:18:00] contributors. One of the most useful resources is dbt’s very own Learn the Fundamentals of Analytics Engineering with dbt course to get new contributors started with the dbt development. Since we were all working remotely, independently and communicating over Google Meet, we started recording our support and pairing sessions.

The idea is that if one of the contributors has this question, others would as well. Our most popular tutorial is a video of setting up your dev environment in datagrip which was recorded on a VIM while onboarding a new team member to increase the data engineering literacy. We leverage the existing knowledge sharing platforms as seminar series, where team members take turns to present an ordered sequence of topics building on top of the previous one. [00:19:00] These sessions are recorded and shared in our handbook to be used later as self-service resources. By all our efforts, we’re rewarded when one of the more recent contributors self-trained themselves when pointed to our Getting Started with the dbt Development page. And they said our documentation was very straightforward. It took us at least six months of continuously improving that page to get to a point where it is now intuitive to a new user.

We now sometimes see fully fleshed out data modeling PRs by new contributors who have not gone through our enablement process. And just by using our self service resources, we’re able to self-train and produce quality data pipelines.

Besides documentation and training, we have live support available through our Slack channels. [00:20:00] Besides seeking help directly from the data team, the users have analysts and developer community channels available where their data or their data modeling questions can be answered by fellow users. The data team also holds weekly office hours where all team members are present to answer any questions by the contributors.

This unstructured time can be used for co-development troubleshooting, debugging, general support, or even ideation. And of course the data team members are always willing to jump on a call to help troubleshoot or answer any questions.

[00:20:47] Results: Process impact #

Raazia Ali: So all of these efforts and processes how has that impacted the overall process? Our efforts have enabled us to truly transition into an [00:21:00] enablement team. Data literacy and data pipeline development have become common skills across GetUpside. Not only that, the contributors not only have the contributors from the citizen data engineers increased many fold, but also multi-dimensional data insights have become accessible to everyone through Looker. Internally, our processes have improved many fold. The turnaround time for new data products has been shortened from delivering a product in multiple sprints to delivering multiple products per sprint. The data team is not at a risk of being off Looker for any data process anymore.

[00:21:46] Results: Business impact #

Raazia Ali: In just three quarters, our enablement model has had a profound impact on the business, as well as business processes, unlocking massive business growth, customer retention, and on top of [00:22:00] that, satisfied internal and external stakeholders.

Here’s the mighty team of four that has accomplished all this in only three quarters while setting up solid foundation for future growth and scalability. So if you know any data or analytic engineers, please send them our way. We are hiring at GetUpside.

So what’s next for data engineering? So we are now looking at solving next generation of data and user scaling challenges while maintaining data integrity and security. More recently, we have had some awesome data engineers join our team who are constantly looking at ways to eliminate bottlenecks in our global dbt DAG to further optimize our analytics model for speed and skill.[00:23:00]

Having mastered data marting skills, many of our contributors are now aspiring to build and own their data ingestion pipeline. This is an active area of research for us as to how we can safely upgrade to contribute our role to enable ingestion into the raw database while maintaining oversight, as well as ensuring data quality of the future models that will be built on top.

We are now extending our first methodology over to our highly underutilized resource of dbt docs. We believe that data documentation should live as close to the data source as possible. Therefore, we are encouraging teams all across GetUpside to contribute their knowledge and experience about common data models and their usage directly to the dbt docs for everyone’s benefit.[00:24:00]

Thank you so much for listening. I look forward to continuing the conversation in the chat. Thank you.

Last modified on: Apr 19, 2022

dbt Learn on-demand

A free intro course to transforming data with dbt