Building an automated data ecosystem: A guide for data engineers using dbt and GoodData from Coalesce 2023

Team members at GoodData, Ryan Dolley and Jan Soubusta, explain how to build a data pipeline for good data.

Team members at GoodData, Ryan Dolley, VP of Product Strategy, and Jan Soubusta, Distinguished Software Engineer, explain how to build a data pipeline for good data. They also share their experiences helping data engineers build an internal project.

Simplified database configuration for data engineers

Jan presents a system designed to streamline the process of data extraction, transformation, and loading for data engineers. The system required only the specification of a new dataset to be downloaded from an API, optimizing the process and making it more manageable for non-experts. He states, "Extracting and loading the data... I want to change one line, nothing more. I just need to specify a new dataset which should be downloaded from GitHub API workflow runs, and I want to crawl all columns.”

Importance of self-service in data engineering

Jan highlights the significance of self-service in data engineering, both for the engineers and for business users. The system they presented had been designed with the aim of making the data engineers self-sufficient—with the end goal of delivering a product that was self-service for business end-users too. He explains, "I created all analytical objects in our UI... and then I executed one command, and I synchronized the state of the environment to the disk to these declarative definitions to YAML files."

Challenges of unified configuration and stability

Despite the system’s capabilities, Jan acknowledges key challenges, including the lack of unified configuration across the entire data stack and the stability of connectors. He expresses his frustration with the lack of unified configuration, saying, "Nothing is more painful for me personally..." He also points out the stability issues with Meltano, a component of the system.

The potential of machine learning in data engineering

Looking to the future, Jan suggests that large language models (LLM) could be utilized in data engineering, specifically for automating the transformation of pipelines. He also proposes the feasibility of generating a large portion of source code using fine-tuned LLMs, saving significant development time. "LLM must be here... I want to use it to transform GitLab pipelines to GitHub pipelines. I don't want to do it manually...it should be definitely feasible nowadays."

The success of the system and open-source approach

Jan shares the successful implementation of the system, noting how data engineers were able to build comprehensive integrations with multiple platforms in a matter of weeks. He also encourages contributions to open-source projects to improve existing tools. "Most importantly, did [the data engineers] succeed...? Obviously, the answer is yes... In a few weeks, they built integration with HubSpot, Salesforce, AWS S3, Google Sheets, and so on... everything is open source."

Jan’s key insights

  • Jan emphasizes the importance of using best engineering practices and maintaining simplicity for data engineers to work effectively
  • He highlights the need for a unified configuration for the whole end-to-end data stack, as different tools have different configurations
  • Jan suggests using fine-tuned, large language models to automate tasks like transforming GitLab pipelines into GitHub pipelines