Accessing raw data

The work #

Data is piped from various sources (transactional database, customer support tool, advertising platforms, etc.) into a cloud data warehouse.

This is done using either data integration tools or custom scripting.

Note: In this guide, we’ll use the term ‘warehouse’ to generically describe the bevy of data storage + compute options on the market today (warehouses, data lakes, lake houses, or just plain old databases).

Owned by #

Analytics engineer or data engineer, depending on the complexity of configuration and your team structure.

If data is coming from commonly-used APIs (ex: Shopify, Stripe), then an analytics engineer can use off-the-shelf data loaders (ex: Fivetran, HVR, Stitch) to configure their own integrations.

A data engineer may use no-code data integration tools as well, because Engineers Shouldn’t Write ETL.

If the integrations required are not commonly supported, then a data engineer would step in to script and deploy a custom tap.

Generally, writing data extraction scripts falls outside the job description of an analytics engineer, although many AEs technically could write API integrations.

Downstream dependencies #

All data work (reporting, operational analytics, data science, etc) depends on having reliably-synced data in the warehouse.

A solid data transformation practice (covered in the next section) will identify when source data is outside of your freshness tolerance, or when data is missing altogether.

Prerequisites

Before embarking on syncing data to an analytics warehouse, it helps to have a good sense of:

  1. The required final output(s): a static report, a specific table structure, a single metric synced to a CRM? This must be clearly communicated by the end user (whether that’s you or someone else) — making assumptions at this point rarely works out well.

  2. What APIs or data sources must be synced to produce that output? Can you isolate specific API endpoints from documentation, or tables if you’re syncing from a transactional database?

  3. Will an off-the-shelf data extraction tool suffice to sync those APIs or data sources, or will custom scripting work be required?

A great analytics engineering practice requires reading the docs! Questions 2 and 3 must be answered by spelunking into API documentation, your transactional database schema, or the documentation of your data integration tool of choice.

The backstory: how this came to be #

The analytics engineering practice is made possible by two fundamental advances in tooling:

  1. Fast, inexpensive and scalable cloud analytics warehouses.
  2. Reliable and low-cost data integration tools, for ingesting data from APIs into said warehouses.

This combination created the entry point for many an analytics team.

Within an hour or two, one person could stand up an MPP cloud data warehouse and get live data flowing into it.

One can start to see how the modern data stack has assembled — one innovation (MPP cloud warehouses) begets another (data integration tools), and teams innovate from there.

If you’re someone who likes to learn how things work, the rest of this section will take you on a historical romp on how these two (warehouses & loaders) co-evolved.

If you’ve been following these innovations since AWS Redshift launched in 2013, then feel free to skip through to the transformation section.