Data transformation process: 6 steps in an ELT workflow
Data transformation is the process of taking raw data and making meaning out of it; it forms the foundation of all analytics work and represents how data practitioners create tangible value from their companies.
The data transformation process can typically be broken down into six generally-defined steps: extraction and loading, exploration, transformation, testing, documentation, and deployment. After these steps are taken, raw data takes a new form of meaningful tables that power your company’s business intelligence efforts.
We’ll use this page to describe what the data transformation process looks like for a typical ELT workflow.
Step 1: Extract and load
If your team is following an ELT workflow, where raw data sources are extracted and loaded into the data warehouse before data is transformed, you’ll need to actually accomplish those extraction and loading steps before beginning your transformation work!
In the extraction process, data is extracted from multiple data sources related to your business. The data extracted is, for the most part, data that teams eventually want to use for analytics work. Some examples of data sources can include: backend application databases, marketing platforms, email and sales CRMs, and more.
This data is typically extracted from their systems by interacting with an Application Programming Interface (API) via custom scripts, or relying on the use of an open-source or Software as a Service (SaaS) ETL tool to remove some of the technical lift.
During the loading stage, data that was extracted is loaded into the target data warehouse. Some examples of modern data warehouses include Snowflake, Amazon Redshift, and Google BigQuery. Examples of other data storage platforms include data lakes such as Databricks’s Data Lakes. Most of the SaaS applications that extract data from your data sources will also load it into your target data warehouse. Custom or in-house extraction and load processes usually require strong data engineering and technical skills.
Step 2: Data exploration
Once raw data is in your data warehouse, it’s time to get an understanding of what this data looks like. During this stage, you’ll find yourself:
- Looking at available ERDs (entity relationship diagrams) and join keys to understand how data joins itself together
- Identifying which columns are potentially missing values or have misleading column names
- Writing ad hoc queries to perform some light analysis or summary statistics on the data—how many rows are there? Are primary keys unique? Is there a primary key?
- Understanding how currency, timezone, and data types differ across data sources
There is no perfect recipe for how to explore your raw data; do what feels necessary for your data sources. If you have high confidence in the accuracy and completeness of your raw data, this step may feel less laborious than if you question the integrity of your data (which, if you’re a data practitioner, is probably your natural inclination 😉).
Step 3: Data transformation
In the actual transformation step, the raw data that has been loaded into your data warehouse, you’re familiar with how that data is structured, and you have a general plan on how to approach it—it’s finally ready for your modeling process to begin! When you first looked at this data during the data exploration stage, you may have noticed a few things about it…
- Column names may or may not be clear
- Tables are not joined to other tables
- Timestamps may be in the incorrect timezone for your reporting
- JSON is abundant and may need unnesting
- Tables may be missing primary keys
…hence the need for transformation! During the actual transformation process, data from your data sources is usually:
- Lightly transformed: Fields are cast correctly, timestamp fields’ timezones are made uniform, tables and fields are renamed appropriately, and more. In dbt, this usually happens in staging models used to create clean, uniform slates for your data.
- Heavily transformed: Business logic is added, appropriate materializations are established, data is joined together, aggregates and metrics are created, etc. In dbt, this usually happens in intermediate and mart models, creating the tables that are ultimately exposed to end business users and business intelligence (BI) tools.
Common ways to transform your data include leveraging modern technologies such as dbt to write modular and version-controlled transformations using SQL and Python. Other solutions include writing custom SQL and Python scripts that are automated by some type of scheduler or utilizing stored procedures.
Step 4: Data testing
Your data is modeled now and feels generally right, but how are you confirming the quality of the data that is transformed? How are you ensuring that the key metrics and data you’re exposing to downstream users are trustworthy and reliable?
At this stage, data testing that meets your organization’s standards should be conducted. This may look like:
- Testing primary key for uniqueness and non-nullness
- Ensuring column values fall in an expected range
- Checking that model relations line-up
- And more.
Using a product like dbt, where you can define code-based tests to run against your data transformations, you can create a system where transformations are easily and regularly tested against the standards you establish.
Step 5: Data documentation
Your data is transformed into meaningful business entities, tested to your standards, and is exposed to end users. How do you expect folks who were not directly involved in the transformation process to navigate through them? What documentation are you creating to clarify business logic in transformations and describe core metrics and columns?
At this stage, transformation is done, but in a way, the work is just getting started. To make your data transformations meaningful and impactful to end users, it’s important to create and maintain robust documentation.
We recommend you start by documenting the following during your transformation process:
- The primary purpose of a transformation or data model—what is the main reason this transformation was created? What important reports is it powering in your BI tool?
- Key columns that have business logic implemented in them or ambiguous column names
- Aggregates and metrics
Documentation for data transformation processes should often be written in a way that is understandable to the people who were not involved in the original process, instead of focusing on the how of a transformation for more technical users. Well-written transformation documentation welcomes business users into the fold of analytics work, and is a foundational way to ensure your end users feels comfortable and empowered to use the data your team works so hard to build.
Step 6: Schedule and automate
Your data transformations are created, tested, and documented—now it’s time to get them out into the world. At this stage, data engineers, data analysts, or analytics engineers will push these transformations to production—the process of running them in a production environment in the data warehouse. These production-ready tables are what analysts will query against for analysis and what a BI tool will read off of.
These data transformations will need to be refreshed, or updated, on a cadence that meets your business needs using some type of scheduler or orchestrator. Using a product like dbt Cloud, you can create your data transformations and tests within a collaborative integrated development environment (IDE) and then run them on a regular cadence using the dbt Cloud job scheduler. Other solutions that often require more technical lift include relying on custom cron schedulers or external orchestrators.
Rise and repeat!
After you build your foundational data transformations, your focus will likely shift to optimizing, governing, and democratizing the work; for each new data source, business problem, or entity needed, there will always be more data transformation work to get done. As a result, a good data transformation process is one that is simultaneously rigid and flexible—the process allows for enough guardrails to make analytics work worthwhile and organized, with enough room to be interesting, challenging, and customized to your business.
Last modified on: Nov 29, 2023