Table of Contents
- Start here
- Accessing raw data
- Data transformation
- Downstream use cases
- Building a data team
- Joining a data team
Version control with Git
Remember that in analytics engineering, we’re mostly borrowing time-tested concepts from software engineering.
Version control, and specifically the Git workflow, might be the primest example of that. You truly can’t write production-grade code without it.
It allows teams to collaborate on the same code repository in “branches,” to avoid stepping on each other’s toes.
When the time is right, work gets reviewed + merged into the “main” branch and into production.
This entire commit history is visible to the team, so you can always audit changes over time.
Let’s take a quick romp through the Git workflow that we use & teach internally at Fishtown Analytics.
Git definitions #
Before we dive in, a couple quick notes on definitions (see more about these in our internal Git guide):
A version of the code specific to a feature, fix or refactor that you’re working on.
An (ideally small!) code update to a branch that you’re working on. We recommend committing early and often.
The process for reviewing code on your branch before it’s merged into production. We recommend following a pull request template for code changes (see our internal pull request template here).
Now, let’s get into the Git flow, and how it’s a must for producing high-quality analytics code.
The basic Git flow #
This flow assumes that you’re working on a remote repository that lives on GitHub, GitLab, Bitbucket, or another cloud-based Git provider.
Step 1 - Clone the original codebase
Before you can do any local development (whether that’s on your own machine, or in a cloud-based IDE like dbt Cloud), you’ll want to clone a copy of the repository you’re working on.
On the command line, you’d use the
git clone command to do this, and the
git pull command to refresh your clone over time.
Step 2 - Create your development branch
Rather than work on the main production branch, you’ll want to create a safe space for your own development work.
Nothing you do here is etched in stone, so play around with the code as much as you like.
On the command line, you’d use the
git branch command to move your local clone of the repo onto a new branch.
Step 3 - Stage your updates to files
When you’ve made file changes that you want to commit to your branch, first you must stage them, which packages them up as a logical save point.
On the command line, you’d do this with a
git stage command. Your branch remains unchanged until you stage your changes, then commit them in the next step.
Step 4 - Commit your staged changes to the local repository
Once you’re happy with your staged changes, it’s time to commit them to your branch.
On the command line, you’d run this with a
git commit command.
To make this commit visible to the rest of your team, you’ll have to push it up to the remote repository in the next step.
Step 5 - Push your changes to the remote repo + open a pull request
Pushing your commits allows anyone else on your team to pull down your code changes + test them out.
On the command line, you’d run this with the
git push command.
If you’re ready to merge your changes into production, you can now open a pull request to have your code reviewed by someone else on your team.
If you want to collaborate without merging, simply push your branch to the remote repository, then ask your colleague to run a pull to retrieve a copy of your branch.
Step 6 - Merge your changes with the master codebase.
Once the code reviewer approves your pull request, you’re now ready to merge it into the main branch - which generally means your changes will be going into production.
If you’re working on the command line, you’d run this with the
git merge command.
But remember that clone sitting on your local machine? Since you’ve made updates to the master codebase, your local copy has become officially out of date.
Step 7 - Pull down a new clone of the main repo
Now that your changes have been merged into the main branch, you’ll want to refresh the repository that you’re developing from.
This is also the case if your colleagues have made changes - you’ll want to periodically pull down the latest code, to avoid having too many conflicts to deal with when you eventually merge your work.
If you’re working on the command line, this is run with the
git pull command.
This may seem like a convoluted process at first, but it’s really critical to keeping multiple analytics engineers on the same page.
Once you move to version-controlling your analytics code, you never go back.