Blog It’s time for open source analytics

It’s time for open source analytics

When a software engineer starts on a new project, her first step is almost always to survey the ecosystem of open source software that exists in the problem space. Read now
It’s time for open source analytics

When a software engineer starts on a new project, her first step is almost always to survey the ecosystem of open source software that exists in the problem space.

There are now tens of millions of open source projects, and this foundation of shared code gives developers leverage. Rather than starting at bare metal and building every layer of the stack, developers select from standard open source components and start building new applications from this foundation.

But data analysts haven’t benefitted from this same trend. SQL, the primary language that analysts use to interact with data, isn’t one of the top 50 languages on Github.

A software developer shows up to every new challenge armed with a huge array of free tools in her belt. An analyst shows up with a blank text editor.

As an analyst myself, that sucks. Fortunately, I think the time has come for open source analytics. In this post I’ll argue that, while there are historically good reasons for the status quo, things are changing. I’ll also describe what open source analytics might look like and share the tools that we’re building at Fishtown Analytics to get there.

Open source analysis, not analytic tools

Before diving in, a quick note. We have had open source analytics tools for some time. R and Python and their numerous packages are great examples, and there are recent projects like Re:Dash that provide more accessible interfaces to analysts as well.

In this article I’m referring to open sourcing the analysis itself, not analytics tools. While analysts have plenty of open source ways to draw a graph and calculate a p value, we have to figure out how to calculate revenue, or visitors, or inventory, over and over and over.

Why open source analytics? Why now?

There are several fundamental changes in the modern analytics landscape that unlock the open source model for analysts.

Analytics is more important than ever

The confluence of several trends means that data is becoming increasingly strategic:

  • Data volume is growing exponentially, fed from web, mobile, and IoT. There are simply more insights to be had.
  • The frontier of the possible has been expanded due to improvements in data processing technology.

As a result, businesses have begun to see the analytics function as a way to create a competitive advantage. This has [driven demand for data professionals, who now command high salaries and are pulled from the ranks of elite universities and graduate schools.

This is the type of environment that creates the opportunity for disruption.

Schemas are becoming standardized

Historically, business software was highly customized. Consultants were paid huge sums of money to build or customize enterprise applications for the Fortune 500, and analysts needed to learn the ins and outs of the schema of each custom application.

Developing this very company-specific expertise was a double-edged sword for these analysts: it represented real job security, but it was also isolating. Because each schema was different, it meant that analytic code for that schema was different. This prevented meaningful collaboration.

Today, the shift of business software to SaaS is fundamentally changing to this status quo. For the first time, analysts have the opportunity to write code that is relevant for their peers. Every one of the 10 million Mailchimp customers can use the same code to analyze their email marketing performance.

The importance of this shift cannot be understated. The open source model is only relevant if groups can work together to solve shared problems. While no two analyses will ever look exactly the same, standardized schemas allow analysts to work together to solve common problems.

Analytics is being done in code

In order for knowledge to be produced via an open source production model, it needs to be managed via a distributed source control process.

Source control allows knowledge to be treated like an asset, with large communities able to track, control, and distribute modifications. Source control is what allows a thousand developers to work together on a single code base.

In the past, most analysts have worked in Excel. Excel is a tremendously useful interface, but it doesn’t submit well to source control. The Excel file format is verbose and mixes data, code, and formatting. Because of this, it is all but impossible for a source control process to merge changes from multiple versions of an Excel file together into a single master copy.

But data tech is advancing, and Excel is no longer the center of the universe for modern analysts. Today, analysts use tools like Stitch, Fivetran, Redshift, Snowflake, Bigquery, Looker, and Mode Analytics to construct their data infrastructures. SQL lives at the heart of this stack.

The layers on top of SQL are often in code, too: Python and R have becoming the dominant tools for sophisticated analysis. These languages provide functionality for both the data processing and visualization components of advanced analytics.

Analysts today work increasingly in code, which submits to source control processes cleanly.

Tools in the open source ecosystem are more accessible

While I can find my way around a command line, I am by no means a software engineer. Many analysts I know are in a similar boat.

Because the open source model grew out of software development, the tools and processes it uses are tuned for this user community. Historically, tools like git, svn, vim, and emacs have been both important and fairly inaccessible for non-developers. Two things are changing this:

  • Tools have become friendlier. Github and Bitbucket both have excellent front ends for non-technical users and GUI-based text editors have gotten quite good.
  • Online courses abound. Short classes aimed at less technical users allow analysts to become proficient in interfaces like bash, git, and vim without having to parse pages of help files.

Analysts now have access to the means of open-source production.

Open source analytics principles

I strongly believe that the development of the open source analytics ecosystem will mirror that of open source software in two specific ways:

Decentralized control

There are very few decision-making bodies within open source software. Software is written by self-organizing groups who have absolute control over the software they write, and it’s installed and used by people who have absolute control over how their software operates. Useful software gets traction, useless software doesn’t.

Do one thing well

The core design ethos for open source software began with UNIX design principles first codified in 1978. The very first rule was “Make each program do one thing well.” Because of the relative ease with which software can be integrated when its source is available, open source software is often composed of many components with multiple authors working together.

Introducing dbt package management

Analysts need tools to enable the collaborative production process I have described. The first one of those is a package manager: a tool that allows analysts to easily package and distribute analytic code for others to use.

Today, we are announcing the launch of package management functionality within dbt, our command line data build tool.

With this release, analysts can create packages of data models and publish them in public git repos. Other analysts can include those repositories within their own projects via their git URL and dbt will download the source and include it in subsequent compile and run phases.

The process facilitated by dbt’s package management functionality embodies both principles I outlined earlier: decentralized control (anyone can create packages with no central gatekeeper) and do one thing well (packages can build on top of one another).

This release lays the groundwork for analysts to collaboratively develop data models on top of common schemas, dramatically accelerating their time-to-insight. We’ve released three packages that are ready for you to use: Stripe, Mailchimp, and Zendesk. Of course, there’s plenty more work that could be done on each, so we encourage you to try them out and contribute back.

To get started with dbt, view the readme. I’d love to hear your thoughts in the comments below.

It’s time for analysts to start working together.

⚡️Ready to improve your analytics engineering workflow?  Get started with dbt today. ⚡️

Last modified on: Apr 25, 2022

dbt Learn on-demand

A free intro course to transforming data with dbt