What are the steps / tools in setting up a modern, SaaS-based BI infrastructure?
Note: This post, originally published in 2016, was most recently updated in December 2017.
Business intelligence tech has changed really dramatically over the past 3–4 years since the advent of cloud-based analytic databases like Amazon Redshift, Snowflake, and Google BigQuery. These databases are so powerful that they fundamentally shifted the way that the surrounding components of the analytics tech stack needed to be built. As a result, a completely new BI tech stack has emerged. Like the cloud databases that this it is built on top of, this stack is far more powerful, more usable, and less expensive than what it’s replacing.
The new BI stack typically involves the following components:
- Data modeling
- BI (query-building, visualization, and dashboarding)
Below I’ll walk through the decisions you need to make and steps you need to take to get up and running with your tech stack.
Quick note on my background: I’m the Founder and CEO of Fishtown Analytics, an an analytics consultancy serving high-growth, venture-funded startups. We have no particular affiliation with any of the commercial products I recommend in this post but they all are products that we use when delivering analytics for our clients.
Step 1: Choose and spin up your database
I’m using the term “database” here, but you could alternately use the term “data warehouse”. If you want to get geeky, you’d say “MPP analytic database” but that’s a bit of a mouthful. Exactly what an MPP analytic database is and how it’s different from a traditional RDBMS is a fascinating topic but one that I’ll have to leave for another post. For now, suffice it to say that analytic databases are a class of databases that are orders of magnitude faster than traditional relational databases at executing analytic queries over very large datasets.
Realistically, as of this writing, you’re talking about one of a small number of products: Amazon Redshift, Google BigQuery, or Snowflake. All three are great products, and in many cases each will serve you equally well. There are times when one will treat you better than others. If you have massive data volumes of things that look like events (typically user activity or machine-generated data), BigQuery will often be your best bet. If you have a more relational structure and a large number of users, Snowflake really shines. If you’re just getting started and feel like your use case is pretty typical, it’s hard to go wrong with Redshift.
If you are not an engineer, you should pull one in to help you make this decision. The tradeoffs are purely technical at this point, and require an understanding of the underlying technologies in question.
Step 2: Choose your ETL tool and get your data loading
Next, you need to figure out how to get your data into your data warehouse. There are a couple things you need to know here:
- ETL (extract, transform, and load) tools traditionally did a lot of T. In fact, transformation was often their specialty: it was really important to transform data into optimized data structures because data warehouses were much slower. Now that data warehouses are so fast, you should do as little transformation as possible prior to loading data into them. If you need to transform your data, do that once it’s sitting in your warehouse.
- In the name of all you hold sacred, do not write your own ETL code. Writing custom ETL code is a guaranteed ticket to software engineering purgatory, as you need to constantly update and fix broken code, monitor jobs, etc. This pain never goes away, but it will force every engineer that works on it to hate their lives (and potentially quit).
There are a large number of good tools available today to ETL your raw data into your data warehouse that are lightweight, do minimal transformations, and don’t require you to write any code at all. We use either Stitch or Fivetran (or both) with our clients and are huge fans of both products. Which one you use ends up being more about which one supports the specific integrations you need and which will end up costing you less. Each one has a free trial: spin up both and see which one suits your needs better.
Step 3: Choose your BI tool
You obviously need a BI tool. It should be a web-based product that connects to your warehouse, runs queries, builds charts, and lets you share them. There more options here than would be interesting to read as a part of this, but I’ll highlight ourfavorite two. If your primary users will be analysts, use Mode Analytics. If you will have both analysts and business users building reports, use Looker. To read more about our thoughts on this space and how we choose what to use with our clients, read this.
If you’ve never worked with any of the cloud-based BI tools on the market, don’t sweat it too much. Try a bunch. Choose what you like. As long as the raw data is in a data warehouse that you own, you can always throw out your BI tool and buy a new one.
Step 4: Model your data
This is the critical step that most people miss when constructing their BI stack. Remember when I said earlier that it wasn’t a good idea to transform data prior to loading it into your data warehouse? Well, that doesn’t mean it isn’t a good idea to transform your data, it’s just that your data warehouse is now so fast that it’s better to load your data in first, and then use your warehouse to transform it. This is sometimes known as the “ELT” approach.
There are a large number of reasons why you might want to transform your data, but they all boil down to this: it’s a good idea to apply a certain amount of business logic to your data before users get their hands on it. Remove erroneous records. Fix data anomalies. Translate old business rules into new ones. Generally apply business logic that will make it easier and more consistent to analyze data once an analyst actually sits down to their keyboard.
Here’s an example: if you do a meaningful amount of advertising, you’ll likely have data piping into your warehouse from Adwords, Facebook Ads, Bing, and more. Many companies run ad campaigns with 10 or more different ad channels. That data will originally be piped into your warehouse in a bunch of different schemas. In the modeling stage, you’ll want to create a single table where you can union together ad performance from all of your different channels so that you can report on it together.
Modeling your data has several big advantages:
- It makes it easier for end users to query, because the data is more structured.
- It makes analytics more consistent and more accurate, since rules are built into data models that all subsequent analyses are built upon.
- It can make long-running queries run faster, because calculations can be done prior to query time.
- It makes it easier to use multiple BI tools in conjunction with one another since your modeled data can be accessed by each tool.
My company, Fishtown Analytics, uses and maintains an open source tool called dbt (data build tool) that helps analysts and engineers model data inside their warehouses. dbt today has over 100 active companies using it, including teams from Casper, Seatgeek, and Wistia. We’d love to have you in our ever-growing Slack community!
Putting it all together
Here’s what the modern BI stack looks like when you put it all together:
Pros who have done BI work in the past expect it to take months to configure this type of tech stack, but with modern SaaS-based tools you can configure the entire stack in a day or less. Which means you can get to work doing real work that much sooner.
Of course, there are plenty of nuances involved in this process, and that’s one of the main reasons why companies frequently pull us in to help us work through it. If you’re interested in learning more about how this tech works and in helping get you off the ground, get in touch. I’d love to chat.nn
⚡️Ready to improve your analytics engineering workflow? Get started with dbt today. ⚡️
Last modified on: Nov 29, 2023