Table of Contents
Data transformation is the process of changing data from one format into another. This can include data type casting, joins, aggregations, and column renaming. Data transformation often refers to the logic embedded into your data models. It is a key component of the ETL/ELT process where the “T” represents the data transformation stage. Transformation is typically performed by the analytics engineer on the team or, depending on the company, a data analyst or data engineer.
Data transformation is a key step in ensuring your data is usable by the business and will lead to actionable, reliable insights. It helps clean your raw data into a format that is actually usable by the data team. It also helps to speed up analysis efforts by automating basic data cleaning that ensures your data can be queried at any point in time. Without it, data analysts would not only have to construct reports and dashboards, but they would also need to standardize and validate their data before doing so. Transformation creates clear, concise datasets that don’t have to be questioned when used. The data team can act on due to the standardization process these datasets have undergone.
Benefits of data transformation #
Data transformation is a necessary part of modern data analytics. But why is that? Well, there are two main reasons: to increase data quality and create more useful datasets.
Increases data quality
Data transformation increases data quality through the process of standardization and automation. Through transformation, you can cast your data to all be the same data type across multiple datasets. You can also convert timestamp values to be the same timezone and currency amounts to be the same unit. By standardizing your data, you leave little room for error. Data analysts don’t need to remember which dataset is in which timezone or currency. They know the data is high-quality because of the standardization that has taken place.
In addition to standarizing raw data sources, metrics can be properly calculated in different data models. Data transformation allows you to apply the same metric calculation across different models and datasets, ensuring they are consistent. This way, analysts do not need to reference formulas to use for different metrics. They can create dashboards and reports knowing the metrics they are using have already been standardized across the business.
Creates reusable, complex datasets
Data transformation allows you to automate various data cleaning and metric calculations. This ensures a consistent, accurate data set is being generated in the data warehouse each day, or on whatever time cadence your business chooses. By automating certain data models, data analysts do not need to repeat the same calculations over and over again within the business intelligence layer. These data sets can be referenced directly within a report or dashboard instead, speeding up compute time.
Data transformation also activates the reverse ETL process. Transformation allows analytics engineers to join different datasets into one data model, providing all the needed data in one dataset. Because datasets are being automated using data transformation, this data can be ingested into different reverse ETL tools, giving stakeholders the data they need, when they need it.
Challenges of data transformation #
Data transformation might sound like a pretty straightforward process, but it’s really not. It can get extremely difficult depending on the complexity of your data, the number of sources you’re pulling from, and the needs of your stakeholders. Some of the biggest challenges you’ll face are: creating consistency, standardization of KPIs, and defining your data modeling conventions.
Consistency across multiple datasets
Being consistent across multiple different datasets, especially when they all have different data sources, can prove challenging. During this process it is important to read documentation from the sources you ingested the raw data from. This will give you an understanding of the different timezones, data types, and units you are dealing with. Once you figure this out, you then need to determine consistent standards to use across all datasets. After defining a standard, you must apply it across these raw data sets.
For example, if data ingested from the Facebook API is all in your business-specific timezone, but you want to use UTC time across all raw data, you need to properly cast those timestamp values. However, data ingested from Mailchimp may already be in UTC time. You can’t assume all raw data sources use the same units. It is important that you take inventory of the metadata and treat each dataset differently to ensure it meets consistent standards.
- Here are some values you should keep in mind when trying to create consistency in your data:
- What timezone are your dates and timestamps in?
- Are similar values the same data type?
- Are all numeric values rounded to the same number of decimal points?
- Are your column names named using the same format?
These are all different factors to consider when creating consistent datasets. Doing so in the transformation stages will ensure analysts are creating accurate dashboards and reports for stakeholders.
Defining data modeling conventions
Defining data modeling conventions is a must when utilizing data transformation within your business. One of the reasons data transformation is so powerful is because of its potential to create consistent, standardized data. However, if you have multiple analytics engineers or data analysts working on your data models, this can prove difficult. In order to create high quality, valuable datasets your data team must decide on style conventions to follow before the transformation process begins.
If proper style guidelines aren’t created, you may end up with various different datasets and data models all following different standards. The goal is to ensure your data is consistent across all datasets, not just across one engineer’s code. We recommend creating a style guide before jumping into the code. This way you can write all of your standards for timezones, datatypes, column naming, and code comments ahead of time. This will allows for discussion amongst the team, saving team and ensuring everyone agrees on the solution.
Standardization of core KPIs
While creating consistency across datasets is important, so is standardizing the definitions of KPIs. Without proper data transformation, it is easy for different parts of the business to define KPIs differently. One team could be using one calculation for revenue and another team could be using a slightly different one, creating disconnect between the business. Data transformation allows you to standardize these KPIs, but it requires lots of business discussion and agreement between teams.
The analytics engineer can’t decide which definition they want to use and just go with it. It involves multiple different stakeholders sharing their thoughts and coming to an agreement over which calculation makes sense for the business as a whole. Different teams have different understandings of what goes into certain KPIs. They also all use them differently, so you need to ensure the calculation you decide on benefits everyone in the way they plan to use it.
Data transformation tools #
Just like any other part of the modern data stack, there are different data transformation tools depending on different factors like budget, resources, and specific use cases. Here are some considerations to keep in mind when looking for a data transformation tool.
Enable engineering best practices
We always recommend choosing a tool that will help you and your team follow engineering best practices. This will allow your data team to scale with the tool while producing high quality data for your stakeholders. You should consider whether or not a data transformation tool offers version control, or connects directly with Github, so that you can always keep track of code changes. It’s also a best practice to document your transformations as you write them, so look for a tool that supports that.
Build vs buy
Sometimes, depending on the size of your company and the resources it has available, it is best to build a data transformation tool internally compared to buying one. This may be appropriate if you have a very specific use-case that no external tool supports. When considering building your own tool, it’s important to look at your budget and the resources you have available. Is it cheaper to pay for an external tool or hire data engineers to do so in-house? Do you have enough engineers to dedicate the time to building this tool?
Open source vs SaaS
If you decide to use an external data transformation tool, you then need to decide whether you want to use an open source tool or SaaS provider. Open source is budget-friendly yet will require more maintenance in the case that something goes wrong. SaaS tools have a whole support team, dedicated to helping you set up the tool and integrate it into your already-existing stack. Whether you choose open source or SaaS will again depend on your specific budget and resources available to you. Do you have the time to integrate open source? Is your stack too complex to make it work? Do you have the budget to pay for a Saas provider? Do you want to depend on someone else to debug errors in your system?
Technical ramp period
Last but not least is the technical learning curve that comes with implementing a data transformation tool. You need to ask yourself if those on your data team have the technical expertise to use whatever tool you choose to implement. Can they code in the language the tool uses? If not, how long will it take them to learn? If it’s an open source tool you decide on you may need to consider whether or not your data team is familiar with hosting that tool on a provider like AWS. Keep in mind the learning curve that ay exist and whether or not your team is willing to take that on.
The backstory: how modern data transformation came to be #
Let’s take a step back to 2016, when dbt was born.
Raw source data was flowing into MPP databases with (relative) ease. At that time, teams were visualizing it in reporting tools like Looker, Tableau and many others. But the transformation layer in between was being duct taped together.
Teams were writing complex stored procedures with often-confusing dependencies. Or sometimes even manually refreshing queries to calculate metrics for visualizations.
Refreshing transformed metrics often required human intervention:
When a query failed, a human would re-run each individual query from that point forward in the dependency graph - inferring those dependencies by intuition or experience.
There wasn’t much of a concept of an automatically-generated dependency graph.
Turns out that many teams were running into this same challenge: plentiful raw data, a useful tool to visualize it, but nothing in between to get datasets ready for use.
dbt emerged to empower data teams to solve this problem.
Data transformation is a key part of the ETL/ELT process within the modern data stack. It changes your raw data from one form to another, allowing you to standardize data types, timezones, and metrics across the business. When you utilize data transformation, you are working to produce high quality data and reusable data sets that will help propel your business forward. While there are challenges, the benefits far outweigh the hurdles you wil jump. If you’re wondering if dbt is the right data transformation tool for you, read this article. For more on data transformation and the best practices to follow, check out dbt’s courses.
Last modified on: Aug 1, 2022