Cloud data warehouses

To understand why cloud analytics warehouses were such a step function improvement, let’s examine what it was like before:

OLTP databases #

If you were setting up an analytics database in the year 2005, it was probably a regular old SQL database (MySQL, SQL Server, PostgreSQL, etc.).

These databases are transactional databases built to store and retrieve application data.

What they do well #

Quickly CRUD (create / read / update / delete) individual rows of data.

Not so much #

Perform aggregate queries or window functions on large datasets. As an analytics database, they’re sloooooow.

This is because transactional databases are row-based, meaning they operate on each row in its entirety as an object.

This is what makes them so performant at updating individual rows — but for analytics, where we may want to query but a few columns at a time out of a row, this results in a lot of wasted compute.

Scaling literally #

To make matters worse, this database was probably stored on-premise, either in a closet in your company’s offices, or at your company’s data center.

If you needed more storage or compute power, your IT team had to literally upgrade your hardware.

Storage and compute were expensive.

These were indeed dark times for analytics teams — a huge premium was placed on transforming data before it made it to your database, to avoid paying for unneeded storage and bogging down queries.

Enter Cloud Data Warehouses #

AWS Redshift launched in late 2012, and was the first MPP (multi-parallel-processing) cloud data warehouse that I know of.

Since Redshift launched, many others have launched competing platforms (BigQuery, Databricks, Snowflake et al), each with their own strengths.

Flipping the script, from rows to columns #

Rather than querying data row-by-row, MPP warehouses are generally columnar, meaning they store + query individual columns separately.

This makes querying large (aka tall) datasets quite performant, as there’s zero waste in terms of querying extraneous columns.

Scaling in code #

These warehouses are also run on AWS, Google, Microsoft, or other cloud providers’ data centers - this means that your compute + storage can be scaled up easily with code + configuration, rather than with physically having to plug in new boxes.

These warehouses are essentially supercomputers with which we could run SQL queries.

What this unlocked #

Most importantly, massive + scalable storage became affordable to almost any business. This meant that teams could, for the first time, store raw data without (too much) concern for storage costs.

As you’ll see in the transformations section, this opened up a whole new data transformation lifestyle for analytics teams.