How Amazon S3 works

last updated on Jul 21, 2025

This post first appeared in The Analytics Engineering Roundup.

In this season of the Analytics Engineering podcast, Tristan is deep into the world of developer tools and databases. If you're following us here, you've almost definitely used Amazon S3 it and its Blob Storage siblings at Microsoft and Google. They form the foundation for nearly all data work in the cloud. In many ways, it was the innovations that happened inside of S3 that have unlocked all of the progress in cloud data over the last decade.

In this episode, Tristan talks with Andy Warfield, VP and senior principal engineer at AWS, where he focuses primarily on storage. They go deep on S3, how it works, and what it unlocks. They close out talking about Iceberg, S3 table buckets, and what this all suggests about the outlines of the S3 product roadmap moving forward.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Operating systems, garage sales, and Xen

Tristan Handy: You’ve done a lot over the last 20 years. Before we get into specifics, can you just share a little about your journey as a software engineer?

Andy Warfield: I just like playing with computers. I studied computer science in Ontario for undergrad, then moved to Vancouver for grad school, then to the UK for a PhD. I worked on operating systems, low-level stuff. I got to work on a hypervisor called Xen, which ended up being used by a lot of cloud providers, including Amazon.

After that, I did a couple of startups, one around Xen. Then I became a professor at UBC, teaching operating systems, networking, and security. Later, I did another startup in storage, and eventually I joined Amazon.

Now I have this highfalutin role—VP and engineer—working across S3, other storage services, and now a bunch of analytics services too. I get to cause trouble in lots of different parts of the cloud.

VP slash distinguished engineer—does that mean you just get to march around telling people how to improve their stuff?

People love that! I’d say about half the time I’m causing trouble—starting things and encouraging new ideas—and the other half I’m helping teams dig out from those ideas. Sometimes I take over a team if we’re doing something especially interesting or innovative, just so I can be closer to the action.

That sounds like a pretty good gig if you can get it.

It’s amazing. I’ve been here nearly eight years, and I still love this job.

The rise of virtualization and the origin of Xen

I want to talk about Xen. You said you were always interested in operating systems, which is kind of a niche fascination. What drew you in?

When I was a kid, we didn’t have much money, so I built computers from garage sale parts in Ottawa. In high school, I found this federal government warehouse that sold off old equipment. I started a little business buying pallets of hardware for cheap, fixing them up, and reselling.

It was chaotic—but I learned a lot. I dealt with machines like IBM DisplayWriters with 8-inch floppy disks and massive dot-matrix printers. Getting them working meant diving into their software and systems.

Eventually I played with Linux, hacked on the kernel, and that all led me into OS research and development.

Tristan: So what is a hypervisor, and why did virtualization become so important in the 2000s?

Andy: There were two big drivers: server utilization and isolation.

Companies had racks full of 1U servers, most of which sat idle most of the time. But they couldn’t share workloads because apps weren’t isolated well—config conflicts, shared resources, etc.

Virtualization allowed multiple operating systems to run on the same hardware, with isolation. It also let you consolidate servers, which had big cost and efficiency benefits.

There was also a technical challenge: x86 processors weren’t designed to be virtualized. That made it a really interesting research problem. We wanted to see if it could even be done—and done efficiently.

Tristan: And Intel eventually started building virtualization support into the hardware?

Andy: Exactly. Our work on Xen and similar projects showed it was possible. That pushed Intel and AMD to add features like VT-x, which made it easier and more performant to run hypervisors.

Tristan: How did AWS end up using Xen?

Andy: I wasn’t part of those internal conversations, but the story goes that a small startup in Cape Town, South Africa, was building a control plane for Xen. That team got picked up by AWS and became the basis for EC2.

Understanding Amazon S3

Tristan: Let’s switch to S3. I think a common mental model is that S3 is just a big pool of SSDs. But that’s clearly not the whole story. How do you explain what S3 actually is?

Andy: That’s one of my favorite questions.

Early on, S3 was like a storage locker. You’d rent space to stash things you didn’t need right away—backups, static files, CDN origins. Latency wasn’t great, but durability and availability were.

Things really changed when the Hadoop community built S3A—an adapter to let Hadoop use S3 instead of HDFS. Suddenly, we had people doing real analytics on S3. The system had enough drives to support massive parallel reads.

Today, workloads are way more demanding. Performance, consistency, and latency matter. We’ve been evolving the system constantly to meet those needs.

Tristan: Are we talking about billions of hard drives?

Andy: I can’t share exact numbers, but yes—it's a lot of hard drives. Some of our largest customers have data spread across millions of drives. And most drives are shared across multiple customers.

Tristan: And these aren’t SSDs?

Andy: Mostly spinning disks, actually. Hard drives are terrible at latency, but they’re cheap and good for bursty workloads. Spreading your data across many disks lets you take advantage of parallelism.

S3’s durability, performance, and scale

Tristan: Let’s talk about S3’s durability promise: 11 nines. How do you achieve that?

Andy: We use erasure coding—a form of RAID-like redundancy that lets you split data into parts and parity blocks. Then we store those shards across different availability zones.

We constantly monitor for failures. Disks die all the time, so we have fleets of processes repairing and maintaining durability. It’s not static. It’s a living system.

Tristan: You must have incredibly precise failure models.

Andy: We do. We track failure rates, temperature sensitivity, vendor behavior—everything. That allows us to be proactive and surgical in how we manage risk.

From Parquet to Iceberg to S3 table buckets

Tristan: I want to talk about table formats. Parquet is everywhere now. And then we got Hive Metastore, then Iceberg. Why did S3 launch table buckets?

Parquet is great, but it’s just files. Customers kept asking for more structured semantics: schema evolution, upserts, ACID transactions.

We saw Iceberg adoption grow rapidly—especially among our largest analytics customers. But they were struggling with operational complexity: too many small files, custom compactors, brittle catalogs.

So we launched S3 table buckets to bring native Iceberg support to S3. That includes:

Automatic compaction
A REST catalog
High-performance access

We wanted to make it easier to treat Iceberg as a storage primitive, not just an analytics backend.

So this is a shift in philosophy—S3 isn’t just object storage, it’s now table-aware?

Exactly. Historically, S3 was just where you stored objects. Now, we’re thinking more about what those objects mean.

We also launched S3 object metadata tables—a way to semantically describe and query your object store, especially useful for AI workloads using retrieval-augmented generation (RAG).

The future of open data and S3

What does the future of S3 look like? Where’s this going?

We’re headed toward more structure, more semantics, and more performance.

Inference workloads are scaling fast. AI models are hitting S3 hundreds of thousands of times per second to do vector lookups. That’s changing how we think about indexing, metadata, and latency.

We want to make S3 the best place to do open, flexible, high-scale data work—from tables to training data to retrieval.

Chapters

[01:42] Meet Andy Warfield

Andy shares his background, including startups, professorship, and his current role as VP & Senior Principal Engineer at AWS.

[05:10] From garage sales to hypervisors

Andy describes his early passion for hardware, OS development, and the origin story behind the Xen hypervisor.

[08:50] Why virtualization took off in the 2000s

Exploring why isolation, utilization, and technical curiosity fueled the rise of hypervisors.

[14:30] Xen vs. VMware and the road to AWS

How Xen became the default for EC2 and the technical differences between virtualization approaches.

[17:35] The origin of EC2 and S3

How a team from Cape Town helped launch AWS compute—and the early days of cloud services.

[20:00] What is S3, really?

Andy breaks down the mental model behind S3: not just object storage, but a scalable data platform.

[22:49] How many drives? More than you think

Why S3 storage spans millions of drives—and how AWS uses scale to deliver performance.

[28:10] The 11 nines durability model

Inside S3’s approach to reliability, failure tolerance, and background repairs using erasure coding.

[32:00] Tail latency and engineering for bursty workloads

Why slow requests matter, and how S3 teams optimize for streaming, AI, and analytics use cases.

[35:20] Iceberg, metadata, and table buckets

The emergence of Apache Iceberg as a table format—and AWS’s new structured storage approach.

[38:00] Why S3 added a REST catalog and compaction

How AWS is simplifying the operational burden of working with Iceberg at scale.

[40:00] A new mental model for object storage

S3 is no longer just about storing files—it’s about managing semantics, lineage, and trust.

[44:00] Looking ahead: S3, RAG, and semantic metadata

How S3 is preparing for the next wave of AI, inference, and context-aware applications.

[47:20] Is Iceberg ready for enterprise?

Andy shares thoughts on enterprise readiness, performance tradeoffs, and real-world adoption of table formats.

[49:05] Wrap-up and reflections

Tristan and Andy reflect on the conversation and where data infrastructure is headed next.

Live virtual event:

Experience the dbt Fusion engine with Tristan Handy and Elias DeFaria on October 28th.

Save your seat

VS Code Extension

The free dbt VS Code extension is the best way to develop locally in dbt.

Install free extension

Latest posts

Company13 min

Coalesce 2025: Rewriting the future of data, analytics, and AI

David Tishgart

on Oct 14, 2025

Press5 min

dbt Labs Affirms Commitment to Open Semantic Interchange by Open Sourcing MetricFlow

Elaine Green

on Oct 14, 2025

Press8 min

dbt Labs Delivers Significant Cost Optimization Results and Agentic AI Features, Powered by Fusion

Elaine Green

on Oct 14, 2025

The dbt Community

Join the largest community shaping data

The dbt Community is your gateway to best practices, innovation, and direct collaboration with thousands of data leaders and AI practitioners worldwide. Ask questions, share insights, and build better with the experts.

Join the Community Explore the community

100,000+active members

50k+teams using dbt weekly

50+Community meetups

How Amazon S3 works

Key takeaways

Operating systems, garage sales, and Xen

The rise of virtualization and the origin of Xen

Understanding Amazon S3

S3’s durability, performance, and scale

From Parquet to Iceberg to S3 table buckets

The future of open data and S3

Chapters

Live virtual event:

VS Code Extension

Share this article

Latest posts

Coalesce 2025: Rewriting the future of data, analytics, and AI

dbt Labs Affirms Commitment to Open Semantic Interchange by Open Sourcing MetricFlow

dbt Labs Delivers Significant Cost Optimization Results and Agentic AI Features, Powered by Fusion

Join the largest community shaping data