Under the hood of Apache Iceberg

last updated on Aug 25, 2025

This post first appeared in The Analytics Engineering Roundup.

If you're a data practitioner, you likely understand Iceberg as a user, why it's important, and how it's changing the way that we build data systems. But you may not know a lot about what's going on beneath the surface.

There are multiple ways to interface with Iceberg catalogs, multiple versions of the Iceberg REST spec. There's several leading catalogs that implement that spec. All this in an ecosystem that includes companies of all sizes, in proprietary and open-source code, and in academic and commercial contexts.

In a few years, all this ambiguity will be behind us, but right now it's very much evolving in real-time. To get an update on the status of the Iceberg ecosystem and to walk through all the developments, Tristan talks with Christian Thiel. Christian is one of the lead architects of Lakekeeper, of one of the most widely used Iceberg catalogs.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Walk us through your background

Christian Thiel: I started in natural language processing, then moved into machine learning applications in manufacturing. Like many people, I realized that the biggest barrier wasn’t the algorithms but the data—its availability, quality, and accessibility. That led me deeper into data architecture and engineering, eventually to building Lakekeeper.

What is Lakekeeper, and what are you building now?

Lakekeeper is an Iceberg catalog implementation—a technical requirement for building distributed, composable analytic systems based on Apache Iceberg. But our vision goes beyond that. We see the future in data collaboration and reliable sharing of data, supported by clear contracts.

For listeners new to Iceberg, what makes it so important?

Iceberg allows organizations to store data once, in an open format, and then use the compute engine best suited for each workload. It’s a foundation for building modern, composable data platforms while avoiding vendor lock-in. If there’s one thing that should be open, it’s the data at the center of your platform.

Some folks might say this sounds like Hadoop all over again—lots of open standards that are hard to integrate. Why is this time different?

The ecosystem has matured. Even big vendors like Snowflake and Databricks are embracing Iceberg, which shows there’s a strong shift toward openness. Plus, the tooling and infrastructure are much easier to deploy today. A modern Iceberg setup is far less complex than a Hadoop environment used to be.

Let’s talk about what’s happening under the hood. How does Iceberg work?

Iceberg organizes data using a metadata hierarchy. At the top, there’s a JSON file that stores high-level table information: snapshots, schema, and locations. Below that are manifests and other layers that keep track of files. This hierarchy is what makes things like time travel, atomic transactions, and schema evolution possible.

What about ongoing maintenance?

There are two key tasks. First, expiring old snapshots so you don’t accumulate unnecessary files. Second, compaction—combining many small files into larger ones

Catalogs are another critical piece. What role do they play?

Catalogs manage the top layer of metadata and coordinate transactions. They make atomic updates possible, allow multiple writers, and handle governance—things like access control and multi-table transactions.

How enterprise-ready is Iceberg today?

Very ready. A year ago, there were still gaps, but today, performance and feature parity with native tables on platforms like Snowflake and BigQuery are strong. Governance and authorization models are still evolving, and different catalogs implement them differently, but the core functionality is there.

Speaking of catalogs, how should someone pick between options like Lakekeeper, Polaris, Unity, AWS Glue, or Gravitino?

Christian Thiel: It depends on priorities. Lakekeeper focuses on performance, extensibility, and ease of use. Polaris is developer-focused but less user-friendly. Unity is tightly integrated into Databricks. Glue now supports the Iceberg REST spec, which makes it more interoperable than before. Gravitino is another option aimed at enterprise-scale environments.

Recently, DuckDB announced DuckLake. What’s your take on that?

It’s interesting, but there are two concerns. First, it uses a database schema directly for the catalog, which creates interoperability issues—similar to the early JDBC catalog in Iceberg that the community eventually moved away from. Second, it was built without community involvement, and openness without adoption isn’t really openness.

That said, for heavy DuckDB users, it could offer optimizations that make queries extremely fast, and if the broader ecosystem adopts it, it could become a viable open format.

What’s next for Lakekeeper?

We’re continuing to invest in table optimization, enterprise features, and data collaboration tools. Our vision is what we call the “unbreakable lakehouse,” where contracts and collaboration guardrails make shared data more reliable. Long-term, we see Lakekeeper as enabling truly collaborative, open data ecosystems.

Chapters

00:00 – IntroductionTristan Handy introduces the episode and the focus on Apache Iceberg.
01:40 – Christian Thiel’s backgroundFrom natural language processing to data engineering.
04:30 – Introduction to LakekeeperWhat Lakekeeper is and its role in the Iceberg ecosystem.
06:00 – Why Iceberg mattersHow open table formats enable flexibility and reduce vendor lock-in.
11:40 – How Iceberg works under the hoodMetadata hierarchy, catalogs, and how state is managed.
21:30 – Maintenance and optimizationSnapshot expiration, compaction, and keeping tables performant.
24:20 – Catalogs and governanceAccess control, multi-table transactions, and security.
31:40 – Enterprise readinessHow Iceberg is evolving for production use in large organizations.
42:10 – Choosing the right catalogOverview of Lakekeeper, Polaris, Unity, Glue, and Gravitt.
47:20 – DuckLake discussionPros, cons, and ecosystem adoption challenges.
52:00 – The future of LakekeeperData contracts, collaboration, and building the “unbreakable lakehouse.”

Live virtual event:

Experience the dbt Fusion engine with Tristan Handy and Elias DeFaria on October 28th.

Save your seat

VS Code Extension

The free dbt VS Code extension is the best way to develop locally in dbt.

Install free extension

Latest posts

Insights7 min

AI unlock: Empowering future-ready analysts

Daniel Poppy

on Oct 27, 2025

Insights6 min

The governance gap: How shadow AI is already reshaping analytics

Daniel Poppy

on Oct 20, 2025

Company13 min

Coalesce 2025: Rewriting the future of data, analytics, and AI

David Tishgart

on Oct 14, 2025

The dbt Community

Join the largest community shaping data

The dbt Community is your gateway to best practices, innovation, and direct collaboration with thousands of data leaders and AI practitioners worldwide. Ask questions, share insights, and build better with the experts.

Join the Community Explore the community

100,000+active members

50k+teams using dbt weekly

50+Community meetups