Apache Iceberg: The Open Table Format Reshaping the Data Lakehouse

Data Engineering · Open Table Formats · 2025

Apache Iceberg:
The Open Table Format
Quietly Winning the Data Wars

Born at Netflix to solve petabyte-scale chaos, Apache Iceberg has become the industry's de facto standard for the modern data lakehouse — and for good reason.

By Arabinda Mohpatra

Published May 2025

Read time ~18 min

SCROLL TO READ

$1B+

Databricks acquires Tabular (Iceberg's creator) — 2024

100%

Snowflake commits to Apache Iceberg as sole open format

Major cloud / engine providers natively supporting Iceberg

Most planned-adoption format per Dremio's 2024 survey

01 — Origin Story

Netflix Had a Problem.
A Petabyte-Scale Problem.

It was 2017, and Netflix's data engineers were fighting a war on two fronts: the relentless growth of their streaming data and the painful limitations of Apache Hive, the then-dominant table format. Querying petabytes of Parquet files on Amazon S3 was like trying to find a specific grain of sand on a beach — without a map. File listings were slow, schema changes were dangerous, and concurrent writes were a gamble.

Ryan Blue and Dan Tsui, engineers at Netflix, decided to build something different. They designed Apache Iceberg from first principles — not as a patch on Hive, but as a complete reimagining of what a table format should be in a cloud-native, distributed world. The result, open-sourced in 2018 and donated to the Apache Software Foundation in 2019, has since grown into the most consequential architectural choice in modern data engineering.

"We're 100% committed to Apache Iceberg. Customers want thriving open ecosystems, and they don't want to be locked in."

— Christian Kleinerman, EVP of Product, Snowflake · Data Cloud Summit 2024

The timing of Iceberg's rise is no accident. The data lakehouse architecture — the idea of combining the cheap, scalable storage of a data lake with the transactional and analytical power of a data warehouse — demands exactly what Iceberg provides: a reliable, engine-agnostic, open metadata layer sitting on top of Parquet files in object storage.

02 — Architecture Deep Dive

What Is Apache Iceberg,
Really?

At its core, Apache Iceberg is an open table format specification. Not a storage engine. Not a query engine. A specification that precisely defines how metadata, manifest files, data files, and catalogs interact — such that any engine implementing the spec can read and write Iceberg tables correctly, without depending on a specific runtime library.

This distinction matters enormously. The open spec is why Spark, Flink, Trino, Snowflake, DuckDB, BigQuery, and dozens of other tools can all interoperate on the exact same Iceberg table without conflict, coordination, or copying data.

◈ Apache Iceberg — Three-Layer Architecture (How a Query Resolves)

The three-layer architecture is the key to everything Iceberg does. The catalog layer stores a pointer to the current metadata file. The metadata layer contains snapshots, partition specs, and schema history — along with manifest files that track every data file along with per-column statistics (min, max, null counts). The data layer is just Parquet files in your object storage — no proprietary format, no lock-in.

When a query runs, Iceberg performs three-level pruning: first eliminating irrelevant partitions, then pruning manifest files, then pruning individual data files using column statistics — all before opening a single file. This is how Iceberg delivers sub-second query planning on tables with billions of rows.

03 — Problem Statement

The Swamp Before Iceberg:
Why Hive-Partitioned Tables Failed

To understand why Iceberg matters, you need to feel the pain it replaces. Hive-partitioned tables on S3 were the de facto standard for years — and they were brutal to maintain at scale.

Before Iceberg

Partition columns leak into schema, breaking downstream queries
S3 file listing for metadata = O(n files) — catastrophically slow at scale
No ACID transactions: concurrent writes corrupt tables silently
Schema changes (column renames, type changes) are destructive operations
No time-travel: yesterday's data is gone once overwritten
Modifying a partition requires rewriting the entire partition
Engine-specific: same table can't be safely shared between Spark, Presto, Hive
DELETE and UPDATE operations are either impossible or require full rewrites

→

Apache Iceberg

Hidden partitioning: users query on real columns, transforms applied automatically
Metadata stored in manifest files — no S3 listing ever required
Full ACID snapshot isolation: reads never blocked by concurrent writes
Schema evolution: add, rename, drop, reorder columns safely via metadata
Time travel: query any historical snapshot with AS OF TIMESTAMP
Partition evolution: change partition strategy as a pure metadata operation
Open spec: any engine reads the same table identically, simultaneously
Row-level deletes and updates via position and equality delete files

04 — Unique Differentiators

What Iceberg Does That
No Other Format Can

All three major open table formats — Iceberg, Delta Lake, and Apache Hudi — solve ACID transactions, schema evolution, and time travel. The real differentiation lies in architectural choices that only become visible when your tables grow to petabyte scale, your team changes over time, or you need to mix query engines.

Partition Evolution — Iceberg Only

Change your partitioning strategy (e.g., daily → hourly) as a pure metadata operation. Old files retain their original partition layout; new files use the new layout. Queries span both seamlessly. Delta Lake and Hudi have no equivalent — this is the feature most often cited by data engineers when justifying Iceberg.

Hidden Partitioning

Users query on event_time, not event_time_day. Iceberg applies partition transforms (bucket, truncate, year/month/day/hour) transparently. No more broken queries from stale partition column knowledge. No more "why is my query scanning everything?" support tickets.

Open Specification Model

Unlike Delta Lake, which is tightly coupled to Databricks' proprietary catalog implementation, Iceberg is defined by a published open specification. Any engine can independently implement it. This is why Snowflake, AWS, Google, Microsoft, and Confluent have all adopted Iceberg — there's no single vendor controlling the roadmap.

Scalable Metadata Architecture

Iceberg's manifest file tree scales to billions of files without performance degradation. Delta Lake's transaction log approach requires replaying JSON commits from the last checkpoint, which can become expensive for tables with millions of files. Iceberg's pruning hierarchy (partition → manifest → file) is O(log n), not O(n).

Snapshot Isolation & Multi-Engine Writes

Iceberg's atomic metadata swap model allows multiple engines to write to the same table concurrently with full snapshot isolation. A Spark job and a Flink streaming job can both write to the same Iceberg table; readers always see a consistent snapshot and are never blocked.

Column-Level Statistics in Manifests

Every manifest file stores min, max, and null counts for every column in every data file it tracks. Query engines exploit this for aggressive file pruning without opening files. The result: dramatically reduced I/O costs on selective queries — a game-changer for cloud-native pricing models where you pay per byte scanned.

05 — Format Comparison

Iceberg vs Delta Lake vs Hudi:
The Honest Comparison

The table format debate of 2025 has matured beyond tribalism. All three formats are production-grade. The right choice depends on your workload, your engine ecosystem, and your team's operational preferences. Here is an honest feature comparison grounded in the latest research.

Feature / Dimension	Apache Iceberg	Delta Lake	Apache Hudi
Open specification	Full open spec	Coupled to Databricks catalog	Partial
Partition evolution	Metadata only — unique	Liquid Clustering alternative	Not supported
Multi-engine support	Broadest ecosystem	Strong Spark / Databricks	Spark + Flink primary
Streaming / CDC workloads	Good	Good	Best — designed for this
Metadata scalability	Excellent (manifest tree)	Good (log replay)	Good
Hidden partitioning	Native	Not available	Not available
Time travel / snapshots	Full	Full	Full
ACID transactions	Full	Full	Full
Schema evolution	Comprehensive	Comprehensive	Good
Vendor lock-in risk	Minimal — open catalog	Moderate — Unity Catalog	Low
Fortune 500 installed base	Growing fast — #1 new adoption	Largest — 60%+ via Databricks	Niche — streaming specialists
Catalog options	Hive · Glue · Polaris · Nessie · REST	Unity · Hive · Glue	Hive · custom

The clearest signal of Iceberg's momentum: Databricks itself — Delta Lake's creator — acquired Tabular (the company founded by Iceberg's original Netflix creators) for over $1 billion in mid-2024, and now offers both Delta and managed Iceberg tables. Even Delta's home platform hedged its bets on the format that has become the interoperability lingua franca.

06 — Snowflake Integration

Snowflake + Apache Iceberg:
The Integration That Changes Everything

Of all the Iceberg integrations in the ecosystem, Snowflake's is arguably the most consequential. Snowflake made Iceberg tables generally available in June 2024, and then went further — open-sourcing the Polaris Catalog (now Apache Polaris), effectively donating a vendor-neutral Iceberg REST catalog to the entire industry. This was not a small gesture.

Apache Polaris: Snowflake's Open Bet on Interoperability

In June 2024, Snowflake announced Polaris Catalog — an open-source, vendor-neutral Iceberg catalog implementation built on the Iceberg REST API standard. It was donated to the Apache Software Foundation, and in 2025 it graduated as an Apache Top-Level Project with contributions from hundreds of engineers across Google, Microsoft, AWS, Confluent, and Dremio.

The strategic implication is profound: instead of coupling catalog and format (as Databricks does with Unity Catalog and Delta), Snowflake chose to standardize on an open catalog. This means any engine can interoperate with Snowflake-managed Iceberg data — Spark, Flink, Trino, DuckDB, and Dremio can all read and write the same tables managed by Polaris without data movement.

How the Snowflake–Iceberg Integration Works

Snowflake offers two modes for working with Iceberg tables, catering to different architectural philosophies:

01

Snowflake-Managed Iceberg Tables

Snowflake writes and manages the Iceberg metadata. Data files are stored in your external cloud storage (S3, GCS, Azure). You get Snowflake's full query performance, governance (column masking, row access policies, object tagging via Horizon), and data sharing — all applied natively to Iceberg tables as if they were native Snowflake objects.
02

External Iceberg Tables (Read from External Catalog)

Point Snowflake at an Iceberg table managed by an external catalog (AWS Glue, Apache Polaris, or Hive). Snowflake queries the table without moving data. Ideal for organizations where Spark or Flink is the primary write engine and Snowflake is the analytics layer. No ETL required — just a catalog integration object and a metadata refresh.
03

Snowpipe Streaming Directly to Iceberg

Snowflake's Snowpipe Streaming SDK now writes data directly to Iceberg table format in real-time. Organizations can stream Kafka topics, IoT events, or CDC streams into Iceberg tables at sub-minute latency — then query them immediately in Snowflake or any other Iceberg-compatible engine.
04

Open Catalog (Managed Apache Polaris)

Snowflake hosts Apache Polaris as a managed service — the Snowflake Open Catalog. It implements the Iceberg REST Catalog API, providing centralized governance for Iceberg tables across any compatible engine. Role-based access, principal roles, catalog-level storage scoping, and validation are all built in.

-- Create an external volume pointing to your S3 bucket
CREATE EXTERNAL VOLUME iceberg_vol
  STORAGE_LOCATIONS = (
    (
      NAME = 'my-s3-us-east'
      STORAGE_PROVIDER = 'S3'
      STORAGE_BASE_URL = 's3://my-data-lake/iceberg/'
      STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789:role/sf-iceberg-role'
    )
  );

-- Create a native Iceberg table — stored in S3, managed by Snowflake
CREATE ICEBERG TABLE events (
  event_id     VARCHAR,
  event_time   TIMESTAMP_NTZ,
  user_id      VARCHAR,
  payload      VARIANT
)
  CATALOG = 'SNOWFLAKE'
  EXTERNAL_VOLUME = 'iceberg_vol'
  BASE_LOCATION = 'events/';

-- Time travel works natively — as with any Snowflake table
SELECT * FROM events
  AT (TIMESTAMP => '2025-03-01 00:00:00'::TIMESTAMP_NTZ);
  

Advantages of the Snowflake + Iceberg Stack

Zero Vendor Lock-In on Storage

Your data lives in your S3/GCS/ADLS bucket in open Parquet files. If you ever leave Snowflake, your data comes with you — readable by any Iceberg-compatible engine. This was simply not true with proprietary Snowflake internal storage.

Snowflake Governance on Open Data

Column masking, row access policies, object tags, and data sharing from Snowflake Horizon apply to Iceberg tables as if they were native. You get enterprise-grade security without migrating data into a proprietary format.

Multi-Engine Access from One Catalog

Via Polaris/Open Catalog, your Spark pipelines, Flink streaming jobs, and Snowflake analytics can all point at the same catalog, see the same tables, and respect the same governance rules — without copying data between systems.

Iceberg v3 Support (2025)

Snowflake has committed to supporting the Iceberg v3 spec, which adds row-level change data capture, geospatial data types, nanosecond-precision timestamps, and deletion vectors — critical capabilities for high-frequency trading, IoT, and operational analytics.

07 — Industry Adoption

The Momentum Is
Undeniable

By every measure available in 2025, Apache Iceberg has achieved escape velocity. The pattern across the industry is remarkably consistent: Iceberg has become the interoperability lingua franca. Even platforms that don't use Iceberg natively now expose their data as Iceberg for cross-engine access.

2017–2018

Netflix engineers Ryan Blue and Dan Tsui build and open-source the first version of Iceberg to solve petabyte-scale Parquet management on S3.

2019

Donated to Apache Software Foundation. Iceberg becomes a top-level Apache project. Adoption begins spreading beyond Netflix.

2022

Snowflake announces Iceberg table preview at its Data Cloud Summit. AWS and Google begin adding native Iceberg support to Glue and BigQuery.

June 2024

Snowflake makes Iceberg tables GA and open-sources the Polaris Catalog, donated to the Apache Software Foundation. Confluent's Tableflow converts Kafka topics directly to Iceberg. AWS launches S3 Tables with built-in Iceberg support.

Mid-2024

Databricks acquires Tabular (the company founded by Iceberg's original Netflix creators) for over $1 billion — the single clearest signal of Iceberg's strategic importance.

2025

Apache Polaris graduates to a Top-Level Apache Project. Iceberg v3 released with streaming, geospatial, and deletion vector support. Dremio's survey shows Iceberg on track to surpass Delta as the most-used format within three years.

Who Is Using Iceberg at Scale?

Netflix

Petabyte-scale streaming analytics. The origin site — hundreds of production Iceberg tables backing recommendation and content systems.

Apple

Major contributor to the Iceberg project. Uses Iceberg for exabyte-scale internal analytics infrastructure.

Large-scale data lakehouse migration to Iceberg for member analytics and ML feature stores.

Adobe

Creative Cloud analytics on Iceberg tables, leveraging multi-engine access across Spark and Trino.

Airbnb

Migrated from Hive-partitioned tables to Iceberg for their core pricing and availability data platform.

AWS (S3 Tables)

Native Iceberg tables built directly into S3 — the ultimate sign of cloud-platform endorsement.

08 — Decision Guide

When Should You
Choose Iceberg?

Despite its broad advantages, Iceberg is not the answer to every question. Here is an honest assessment of where each format excels:

Choose Iceberg When...

→ You need multi-engine interoperability (Spark writes, Snowflake queries, Flink streams)
→ Avoiding vendor lock-in is a top priority
→ Your tables will grow to petabyte scale with frequent schema or partition changes
→ Your cloud bill is driven by data scanned (you need maximum file pruning)
→ You're building on Snowflake, Trino, Dremio, or BigQuery as primary query engines

Consider Delta Lake When...

→ Your stack is deeply Databricks-centric and you value tight Spark integration
→ You already have a large Delta installed base that works well
→ Batch processing performance on TPC-DS benchmarks is a primary concern

Consider Hudi When...

→ Your primary workload is high-frequency CDC (Change Data Capture) from databases
→ You need record-level upserts at massive scale (Uber, Robinhood, Walmart use case)
→ Merge-on-Read with async background compaction is your preferred write pattern

09 — Conclusion

Iceberg Isn't a Table Format.
It's a New Data Contract.

Apache Iceberg's success cannot be explained by a single killer feature. It is the accumulation of correct architectural decisions made years ago — an open specification instead of an implementation, hidden partitioning instead of user-visible partition columns, a manifest tree instead of a flat transaction log — that compound into a system uniquely suited to the fragmented, multi-engine, cloud-native reality of modern data infrastructure.

The $1 billion Tabular acquisition, AWS S3 Tables, Snowflake's full commitment, Confluent Tableflow, and Apache Polaris are not independent events. They are all acknowledgments of the same conclusion the industry has quietly reached: when data needs to flow freely between engines without copying, without lock-in, and without silent corruption, Apache Iceberg is the answer.

For data engineers building new lakehouses today, Iceberg is not a bet on a technology. It is a bet on the principle that open standards — properly designed, with governance given to the community — outcompete proprietary formats in the long run. History, in data infrastructure, has consistently favored the open standard. Apache Iceberg looks set to be the clearest example of this truth in a generation.

The Iceberg Is Just the Tip

Iceberg v3, geospatial support, streaming-native writes, and Apache Polaris as a production-grade catalog: the format that started as Netflix's internal tool is now the foundation of the next decade of data infrastructure.

Sources & Further Reading

Onehouse. (Oct 2025). Apache Hudi vs Delta Lake vs Apache Iceberg: Lakehouse Feature Comparison. onehouse.ai
Dremio. (2025). Comparison of Data Lake Table Formats: Apache Iceberg, Apache Hudi and Delta Lake. dremio.com
Körükcü, Y.A. (Feb 2026). Apache Iceberg vs Delta Lake vs Hudi: The Real Differences Nobody Explains Simply. Medium.
Xenoss. (Aug 2025). Apache Iceberg vs Delta Lake vs Hudi Comparison. xenoss.io
Snowflake Engineering Blog. (2025). Apache Polaris: The End of Data Vendor Lock-In. snowflake.com
TechTarget. (Apr 2025). Snowflake broadens open-source embrace, ups Iceberg support. techtarget.com
Atlan. (Mar 2025). Apache Iceberg in Snowflake: A Practical Guide. atlan.com
VentureBeat. (Jun 2024). Snowflake unveils Polaris, a vendor-neutral open catalog for Apache Iceberg. venturebeat.com
LakeFS. (Mar 2025). Hudi vs Iceberg vs Delta Lake: Data Lake Table Formats Compared. lakefs.io
Reintech. (Apr 2026). Apache Iceberg vs Delta Lake vs Apache Hudi 2026: Table Format Comparison. reintech.io

Search This Blog

Apache Iceberg: The Open Table Format Reshaping the Data Lakehouse

Apache Iceberg:
The Open Table Format
Quietly Winning the Data Wars

Netflix Had a Problem.
A Petabyte-Scale Problem.

What Is Apache Iceberg,
Really?

The Swamp Before Iceberg:
Why Hive-Partitioned Tables Failed

What Iceberg Does That
No Other Format Can

Iceberg vs Delta Lake vs Hudi:
The Honest Comparison

Snowflake + Apache Iceberg:
The Integration That Changes Everything

Apache Polaris: Snowflake's Open Bet on Interoperability

How the Snowflake–Iceberg Integration Works

Advantages of the Snowflake + Iceberg Stack

The Momentum Is
Undeniable

Who Is Using Iceberg at Scale?

When Should You
Choose Iceberg?

Iceberg Isn't a Table Format.
It's a New Data Contract.

The Iceberg Is Just the Tip

Sources & Further Reading

Comments

Post a Comment

Hi, I'm Arabinda Mohapatra

Apache Iceberg: The Open Table Format Reshaping the Data Lakehouse

Apache Iceberg:The Open Table FormatQuietly Winning the Data Wars

Netflix Had a Problem.A Petabyte-Scale Problem.

What Is Apache Iceberg,Really?

The Swamp Before Iceberg:Why Hive-Partitioned Tables Failed

What Iceberg Does ThatNo Other Format Can

Iceberg vs Delta Lake vs Hudi:The Honest Comparison

Snowflake + Apache Iceberg:The Integration That Changes Everything

Apache Polaris: Snowflake's Open Bet on Interoperability

How the Snowflake–Iceberg Integration Works

Advantages of the Snowflake + Iceberg Stack

The Momentum IsUndeniable

Who Is Using Iceberg at Scale?

When Should YouChoose Iceberg?

Iceberg Isn't a Table Format.It's a New Data Contract.

The Iceberg Is Just the Tip

Sources & Further Reading

Comments

Post a Comment

Apache Iceberg:
The Open Table Format
Quietly Winning the Data Wars

Netflix Had a Problem.
A Petabyte-Scale Problem.

What Is Apache Iceberg,
Really?

The Swamp Before Iceberg:
Why Hive-Partitioned Tables Failed

What Iceberg Does That
No Other Format Can

Iceberg vs Delta Lake vs Hudi:
The Honest Comparison

Snowflake + Apache Iceberg:
The Integration That Changes Everything

The Momentum Is
Undeniable

When Should You
Choose Iceberg?

Iceberg Isn't a Table Format.
It's a New Data Contract.