Table of Contents

The Future of Analytics is Polyglot

Caitlin is Co-founder and CTO at Hex.

Originally presented on 2021-12-08

Data tools today perpetuate siloing and fragmentation of workflows, with bright lines between languages and abstractions. BI tools top out quickly, SQL tools don’t allow more complex code, and Python/R workflows are often terrible at data queries.

We believe the future of analytics workflows is polyglot and will embrace better interoperability between languages and UIs.

Follow along in the slides here.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Elize Papineau: [00:00:00] Hello, everyone. Welcome to Day 3 of Coalesce. My name is Elize Papineau. I am a senior analytics engineer at dbt Labs, and I’ll be the host for this session. The title of this session is The Future of Analytics Is Polyglot, and we’ll be joined by Caitlin Colegrove, who is the CTO at Hex. Hex is a polyglot data workspace for teams.

Before we get started, all of our chat conversation for this session is going to take place in the #coalesce-polyglot-analytics channel of dbt Slack. If you’re not a part of the chat, you have time to join. Visit community.getdbt.com and search for the coalesce-polyglot-analytics channel.

When you enter this space, we encourage you to ask other attendees questions, make comments and react in the channel. We’re all here experiencing this together. So let’s interact together in the channel. After the session, the speaker [00:01:00] will be available in the Slack channel to answer any of those questions that you have. Let’s get started over to you, Caitlin.

Caitlin Colgrove: Thanks, Elize. Hi everyone. I’m Caitlin. I’m the CTO over at Hex. For those who haven’t heard of us, we are a data workspace for teams. And as part of my role at Hex, I get to spend a fair amount of time thinking about how we can or should focus on incorporating various different technologies into the products. That’s the T part of the CTO.

And today I want to talk about one specific area of that, which is the use of programming languages in analytics. So this is definitely an area where I personally, I’m still learning new things every day in this space. So rather than talk your ear off for the whole 30 minutes I’m going to give my brief take on where I sit and see things today, where it’s going, some interesting technological trends that we’re seeing. And then I’d love to transition that [00:02:00] discussion over to Slack and hear from you all, what you think I got wrong, whether to talk or argue and just hear more about where you think the space is going.

[00:02:09] A Brief History of Analytics Languages #

Caitlin Colgrove: But to get started. I think it can be helpful to take a little bit of a walk down memory lane and look at the history of code and analytics. So in the beginning there was SQL. It’s almost funny with this big resurgence and SQL. I know here we’re here all here at Coalesce, we’re all big SQL lovers and it’s sometimes easy to forget that SQL is actually a very old language.

It was developed at IBM in 1974 for use with, at the time, the equally new technology of the day, which was relational databases. And SQL was this huge revelation. It was a big step up from all sort of a previous query technology. The fundamental innovation was that you could specify which data you wanted as distinct from how the data was actually retrieved on the server, [00:03:00] which meant that a lot of the complexity around querying data, it could be hidden away at the database level instead of requiring the program or the analyst to reason about it.

All right. So let’s keep going, moving into the eighties and the nineties. We see this rise and with brand new technology, the first breakout example, being a Lotus 1, 2, 3, which I don’t know how many of us here are old enough to remember that, but I am sure on the other hand that we are all familiar, whether we like it or not with it’s far more popular and long lasting competitor, which is in fact Excel.

[00:03:32] 1990s: Spreadsheets #

Caitlin Colgrove: So spreadsheets aren’t often grouped with other programming languages. A lot of cases they’re seen as less sophisticated, but especially towards the upper end there, they’re really quite powerful. So for example, in Excel, you can do things like DBA, which is in itself, a full fledged program.

Spreadsheets were great. They were a lot more approachable than SQL at the time because you didn’t have to learn a whole new language to get started. And with [00:04:00] newly ubiquitous operating systems like Windows, spreadsheets were extremely portable between computers, which made collaboration very easy.

So as a result, spreadsheets were the dominant technology in business analytics through the eighties, nineties, into the 2000s. And you could even argue that despite all of the innovation in this space, despite the thousands of people there at this conference today, talking about SQL and various other languages, that they’re still the most dominant analytics technology today, especially in particular disciplines, such as finance.

[00:04:31] 2010s: Python & R #

Caitlin Colgrove: That’s some pretty incredible thing, however, for any technology, quite frankly. Looking at about 10 to 15 years ago, we have this revolution in data science workflows suddenly get a lot more complicated. We’re talking about ingesting and cleaning data, training models, doing things at scale for the first time really, and this just required a much more powerful set of tools than the spreadsheet could provide. So this timeframe is where we see the [00:05:00] rise of full on programming, the big ones being Python and R, but there’s a bunch of other entries here. MATLAB is another classic example here. And these became really common in sort of ubiquitous data analysis.

Not only are these languages much more flexible and powerful than the types of things that you can do in a spreadsheet, but they also have a very rich library data system, which is a new development. And a lot of these libraries were designed specifically for data science and visualization. You have pandas, you have ggplot, and so you’re off of usable with a single import statement. So every time you moved to a new project or team or whatever, you’re no longer reinventing the wheel. Using programming language also has another great advantage, which it allows you to adopt more software engineering practices to help manage the increased complexity of the projects that came with these higher end data science workflows.

[00:05:55] ~2017: SQL Again? #

Caitlin Colgrove: All right. About five years ago, we see this [00:06:00] huge resurgence in SQL, and this has been dramatic by a lot of tools like dbt that allow you to take, like a fairly, actually small amount of code comparatively and power using these incredibly complicated and powerful workflows. It is worth calling out that this is not your grandmother’s SQL.

We’ve learned a lot in the past 50 years and today’s SQL tools come supercharged with a lot of powerful new features. So one example that you can see here is Jinja templating, but they still retain a lot of the benefits that early developers envision. The fact that SQL is a bit more on the rail than Python or R, it doesn’t give you complete control over the implementation. It is actually truly an enormous advantage because it makes the language really approachable to get started with. And then on the backend, much more portable and scale. So you combine that with the engineering practices forged in Python and R and you have what seems like a pretty compelling [00:07:00] set of workloads here.

All right. So fast forward to this. Unfortunately, all of these paradigms spreadsheets, Python, SQL, they have their own Debo teams. Teams are often divided internally between multiple languages. You have your SQL groups, you have your data scientists working at R, Python and you have your people over finance using their spreadsheets.

Workflows often spread across multiple languages, and often this requires using completely different, completely incompatible tools. Integration across these workloads is really challenging as is collaboration. Overall, it’s generally a mess. So how do we deal with this one proposed solution? Obviously would be why don’t we just use insert my language of choice for everything.

This is appealing for a lot of reasons. You’d only have to learn one language. You only have to use one piece of technology, but unfortunately in practice, things, when you try to do this, get[00:08:00] really ugly. There’s a reason that all of these languages are still around. It’s because they’re good, different.

So quickSQL versus Python example. So fun fact, it’s actually provable mathematically that every program that’s implementable in SQL is implementable in Python and vice versa. In computer science, this is called Turing completeness for Alan Turing. Even though that’s true mathematically in practice, the amount of effort that this requires can be wildly different.

For example, training a model SQL, this is a really simple example, this is just a linear regression, but you can imagine, TensorFlow like all of these really complicated modeling libraries it’s theoretically possible, but in practice can be extremely difficult bordering on impossible, and really just generally not how you want to be spending your time.

And maybe someday someone’s going to invent a language that elegantly solves every programming problem. By the way, if you’re working on such a language, let me know. I’d love to be your beta tester, [00:09:00] but I think until I see something like that, it’s best to just assume that this polyglot environment will be true for the foreseeable future and that some languages will inherently be better at certain things than others.

And that if you try to wedge everything into a single language, it’s actually going to wind up being more complicated than if you just use the right language at each step. Okay. So what is a better solution. For this, we take this from the internet and we ask ourselves, Por que no los dos? And indeed, why not both. For me having worked with many of these paradigms and observed these different trends over the years, I’m very convinced that the only way forward is to embrace the fact that the world of analytics is inherently polyglot.

[00:09:47] Defining Polyglot #

Caitlin Colgrove: And looking at the way things are moving, I’m convinced that the next era of data will also be polyglot. In this context, polyglot, we’re really just talking about using multiple different languages [00:10:00] in the same workflow. And this is definitely easier said than done. I don’t really think we’ve seen the full potential of what polyglot technologies can be.

And in the last few years there’s been so much movement and development in this space. There’s a lot of really exciting stuff going on here. That’s going to make operating a polyglot manner much more tractable than it was even five years ago. So a couple of technological highlights here. So the first one is this technology called Apache Arrow. It’s being developed by Wes McKinney who also incidentally created this other library, it’s obscure, you might’ve heard of it. It’s called pandas. But Arrow’s goal is to be a cross language development platform for memory data.

Okay. So what does this mean and why is this good. First you can think of Arrow as a universal data frame format that can be read by any language. Previously, a lot of workflows were reduced to passing those CSVs around, which is extremely limited and [00:11:00] very slow. Arrow optimizes the heck out of this, which makes it an excellent choice for implementing polyglot systems.

And it’s actually in fact what we use for our backend interchange format at Hex. But Arrow also takes this one step further for most languages like Python and R. Objects have different labs in memory, which means that typically to move things like data frames between languages, you actually have to completely rewrite the underlying data structure, which is a process that scales with the size of the underlying data. Arrow completely eliminates this.

And in certain cases it’s actually even more highly optimized, which makes for transfers that are hundreds of times faster than other.

Arrow, at its core is a dataframe but it also has one additional, very cool feature, which is the data fusion project. So the data fusion project allows you to query Arrow data frames with SQL as if [00:12:00] it was a database. In a similar vein DuckDB is a project that builds itself as, no pun intended, an extremely fast in-memory database, which it absolutely is.

But it also has this killer feature that you can feed it pandas or Arrow dataframe and without any additional overhead using technology, like the zero copy format that we just talked about for Arrow. You can then query the data frame with SQL with no performance penalty for the translation.

Obviously this is huge. I’m sure everyone here has had a moment where we were like, Oh, this is so hard and SQL. I wish I could just use pandas or this is so hard and pandas I wish I could just use SQL. Now you can. It takes a bit of implementation, but over the summer, actually at Hex, we shipped a feature that we called data from SQL, which does exactly this.

And as you might imagine, it’s been by far one of our most popular features. One other trend that I’m finding actually really fascinating is that the data warehouse layers. The Snowflake [00:13:00] versus Databricks conflict is really heating up. I don’t know who’s been following this with the benchmarks that were released a few weeks ago.

If you haven’t you should grab some popcorn and check it out. But the reason this was actually really interesting is that these companies did not start as competitors. They actually started with completely different stances on language support. Snowflake was one of the OG pioneers of being all in.

On the other end, Databricks started out focusing on much higher end, flexible multipurpose data science workflows. But look what’s happened in the last couple of years with Snowflake. Snowflake now has support for Java and Python UGS as well as working directly with data firms. And on the other side, Databricks is Delta Lake, which is basically just a SQL driven data warehouse.

So this might be a problem for both of the companies competing with each other, but honestly, it’s great for all of us, because what this means is that increasingly the most common analytics platforms are going to have support for multiple languages out of the [00:14:00] box, making polyglot workflows much more accessible to everyday analysts.

The last piece of this is actually a pretty interesting and active right now is looking at the data workspace layer. There’s a lot of innovation going on here. And I actually don’t think we know where this is going to end up yet. That’s actually fun for me personally to watch, but just some examples that exist today.

On the notebook front, you have things like Netflix as Polynote and TwoSigma BeakerX. Interestingly, these are both polyglot notebooks that use the JVM, the Java virtual machines. They’re losing Java under the hood to power polyglot notebook departments. You also have workspaces and there are a lot of really interesting products here.

There are some established players like Mode, which has Python and SQL as well as newer tools that are really innovating here. So one example is Sigma, which combines BI and spreadsheets in a really interesting way. Obviously, Hex also one of these, but I’m not really going to bore you with the Hex commercial here. If you’re interested, just hit me up offline, have a [00:15:00] chat about it.

[00:15:06] Looking Forward #

Caitlin Colgrove: So that’s where we’re at today. But I think one of the big advantages that doesn’t get talked about being a polyglot by design is that you can really easily incorporate new languages as they’re being developed. And there’s a lot of really interesting stuff out there, both stuff that exists today, and I’m sure stuff that’s going to be developed in the future.

Some general purpose languages like Julia, there’s domain specific ones like Malloy that was just released by Looker for describing data transformations. Another interesting entry is JavaScript, which is a very old language, but is increasingly becoming for data visualization. And who knows what else is going to be designed in the future?

At the start of this talk, we saw over 50 years, we had multiple different languages, multiple different paradigms, and often these were adopted for years or decades at a time. And I think it’s highly likely that over the next 50 years [00:16:00] or so, we’ll see a similar pattern of cycles of innovation and adoption.

And so we want to be ready for that. And in my opinion, the best way we can be ready for that future is by adopting a polyglot flexible tool chain today right now. And that’s the end of my talking part . Sadly as much as we like it to be the case, I just don’t think one language is going to be able to elegantly tackle all of the use cases and analytics.

And I think that’s totally okay. There’s a lot of new technology coming down the line. That’s going to dramatically improve our ability to operate in this increasingly calling. Yeah. So at this point, we’ll look at, have you join me over in Slack. We have some discussion questions. It looks like we have a lively discussion going on already.

And yeah, just let me know where you think I’m right or wrong or what’s going on in this space. And if you’re not in Slack, feel free to hit me up at caitlin@hex.tech. Thanks everyone for coming.

dbt Learn on-demand

A free intro course to transforming data with dbt