Apache Iceberg: The Open Table Format Reshaping the Data Lakehouse
Data Engineering · Open Table Formats · 2025
Apache Iceberg:
The Open Table Format
Quietly Winning the Data Wars
Born at Netflix to solve petabyte-scale chaos, Apache Iceberg has become the industry's de facto standard for the modern data lakehouse — and for good reason.
Netflix Had a Problem.
A Petabyte-Scale Problem.
It was 2017, and Netflix's data engineers were fighting a war on two fronts: the relentless growth of their streaming data and the painful limitations of Apache Hive, the then-dominant table format. Querying petabytes of Parquet files on Amazon S3 was like trying to find a specific grain of sand on a beach — without a map. File listings were slow, schema changes were dangerous, and concurrent writes were a gamble.
Ryan Blue and Dan Tsui, engineers at Netflix, decided to build something different. They designed Apache Iceberg from first principles — not as a patch on Hive, but as a complete reimagining of what a table format should be in a cloud-native, distributed world. The result, open-sourced in 2018 and donated to the Apache Software Foundation in 2019, has since grown into the most consequential architectural choice in modern data engineering.
"We're 100% committed to Apache Iceberg. Customers want thriving open ecosystems, and they don't want to be locked in."
— Christian Kleinerman, EVP of Product, Snowflake · Data Cloud Summit 2024The timing of Iceberg's rise is no accident. The data lakehouse architecture — the idea of combining the cheap, scalable storage of a data lake with the transactional and analytical power of a data warehouse — demands exactly what Iceberg provides: a reliable, engine-agnostic, open metadata layer sitting on top of Parquet files in object storage.
What Is Apache Iceberg,
Really?
At its core, Apache Iceberg is an open table format specification. Not a storage engine. Not a query engine. A specification that precisely defines how metadata, manifest files, data files, and catalogs interact — such that any engine implementing the spec can read and write Iceberg tables correctly, without depending on a specific runtime library.
This distinction matters enormously. The open spec is why Spark, Flink, Trino, Snowflake, DuckDB, BigQuery, and dozens of other tools can all interoperate on the exact same Iceberg table without conflict, coordination, or copying data.
The three-layer architecture is the key to everything Iceberg does. The catalog layer stores a pointer to the current metadata file. The metadata layer contains snapshots, partition specs, and schema history — along with manifest files that track every data file along with per-column statistics (min, max, null counts). The data layer is just Parquet files in your object storage — no proprietary format, no lock-in.
When a query runs, Iceberg performs three-level pruning: first eliminating irrelevant partitions, then pruning manifest files, then pruning individual data files using column statistics — all before opening a single file. This is how Iceberg delivers sub-second query planning on tables with billions of rows.
The Swamp Before Iceberg:
Why Hive-Partitioned Tables Failed
To understand why Iceberg matters, you need to feel the pain it replaces. Hive-partitioned tables on S3 were the de facto standard for years — and they were brutal to maintain at scale.
- Partition columns leak into schema, breaking downstream queries
- S3 file listing for metadata = O(n files) — catastrophically slow at scale
- No ACID transactions: concurrent writes corrupt tables silently
- Schema changes (column renames, type changes) are destructive operations
- No time-travel: yesterday's data is gone once overwritten
- Modifying a partition requires rewriting the entire partition
- Engine-specific: same table can't be safely shared between Spark, Presto, Hive
- DELETE and UPDATE operations are either impossible or require full rewrites
- Hidden partitioning: users query on real columns, transforms applied automatically
- Metadata stored in manifest files — no S3 listing ever required
- Full ACID snapshot isolation: reads never blocked by concurrent writes
- Schema evolution: add, rename, drop, reorder columns safely via metadata
- Time travel: query any historical snapshot with AS OF TIMESTAMP
- Partition evolution: change partition strategy as a pure metadata operation
- Open spec: any engine reads the same table identically, simultaneously
- Row-level deletes and updates via position and equality delete files
What Iceberg Does That
No Other Format Can
All three major open table formats — Iceberg, Delta Lake, and Apache Hudi — solve ACID transactions, schema evolution, and time travel. The real differentiation lies in architectural choices that only become visible when your tables grow to petabyte scale, your team changes over time, or you need to mix query engines.
event_time, not event_time_day. Iceberg applies partition transforms (bucket, truncate, year/month/day/hour) transparently. No more broken queries from stale partition column knowledge. No more "why is my query scanning everything?" support tickets.Iceberg vs Delta Lake vs Hudi:
The Honest Comparison
The table format debate of 2025 has matured beyond tribalism. All three formats are production-grade. The right choice depends on your workload, your engine ecosystem, and your team's operational preferences. Here is an honest feature comparison grounded in the latest research.
| Feature / Dimension | Apache Iceberg | Delta Lake | Apache Hudi |
|---|---|---|---|
| Open specification | Full open spec | Coupled to Databricks catalog | Partial |
| Partition evolution | Metadata only — unique | Liquid Clustering alternative | Not supported |
| Multi-engine support | Broadest ecosystem | Strong Spark / Databricks | Spark + Flink primary |
| Streaming / CDC workloads | Good | Good | Best — designed for this |
| Metadata scalability | Excellent (manifest tree) | Good (log replay) | Good |
| Hidden partitioning | Native | Not available | Not available |
| Time travel / snapshots | Full | Full | Full |
| ACID transactions | Full | Full | Full |
| Schema evolution | Comprehensive | Comprehensive | Good |
| Vendor lock-in risk | Minimal — open catalog | Moderate — Unity Catalog | Low |
| Fortune 500 installed base | Growing fast — #1 new adoption | Largest — 60%+ via Databricks | Niche — streaming specialists |
| Catalog options | Hive · Glue · Polaris · Nessie · REST | Unity · Hive · Glue | Hive · custom |
The clearest signal of Iceberg's momentum: Databricks itself — Delta Lake's creator — acquired Tabular (the company founded by Iceberg's original Netflix creators) for over $1 billion in mid-2024, and now offers both Delta and managed Iceberg tables. Even Delta's home platform hedged its bets on the format that has become the interoperability lingua franca.
Snowflake + Apache Iceberg:
The Integration That Changes Everything
Of all the Iceberg integrations in the ecosystem, Snowflake's is arguably the most consequential. Snowflake made Iceberg tables generally available in June 2024, and then went further — open-sourcing the Polaris Catalog (now Apache Polaris), effectively donating a vendor-neutral Iceberg REST catalog to the entire industry. This was not a small gesture.
How the Snowflake–Iceberg Integration Works
Snowflake offers two modes for working with Iceberg tables, catering to different architectural philosophies:
-
01Snowflake-Managed Iceberg TablesSnowflake writes and manages the Iceberg metadata. Data files are stored in your external cloud storage (S3, GCS, Azure). You get Snowflake's full query performance, governance (column masking, row access policies, object tagging via Horizon), and data sharing — all applied natively to Iceberg tables as if they were native Snowflake objects.
-
02External Iceberg Tables (Read from External Catalog)Point Snowflake at an Iceberg table managed by an external catalog (AWS Glue, Apache Polaris, or Hive). Snowflake queries the table without moving data. Ideal for organizations where Spark or Flink is the primary write engine and Snowflake is the analytics layer. No ETL required — just a catalog integration object and a metadata refresh.
-
03Snowpipe Streaming Directly to IcebergSnowflake's Snowpipe Streaming SDK now writes data directly to Iceberg table format in real-time. Organizations can stream Kafka topics, IoT events, or CDC streams into Iceberg tables at sub-minute latency — then query them immediately in Snowflake or any other Iceberg-compatible engine.
-
04Open Catalog (Managed Apache Polaris)Snowflake hosts Apache Polaris as a managed service — the Snowflake Open Catalog. It implements the Iceberg REST Catalog API, providing centralized governance for Iceberg tables across any compatible engine. Role-based access, principal roles, catalog-level storage scoping, and validation are all built in.
Advantages of the Snowflake + Iceberg Stack
The Momentum Is
Undeniable
By every measure available in 2025, Apache Iceberg has achieved escape velocity. The pattern across the industry is remarkably consistent: Iceberg has become the interoperability lingua franca. Even platforms that don't use Iceberg natively now expose their data as Iceberg for cross-engine access.
Who Is Using Iceberg at Scale?
When Should You
Choose Iceberg?
Despite its broad advantages, Iceberg is not the answer to every question. Here is an honest assessment of where each format excels:
- → You need multi-engine interoperability (Spark writes, Snowflake queries, Flink streams)
- → Avoiding vendor lock-in is a top priority
- → Your tables will grow to petabyte scale with frequent schema or partition changes
- → Your cloud bill is driven by data scanned (you need maximum file pruning)
- → You're building on Snowflake, Trino, Dremio, or BigQuery as primary query engines
- → Your stack is deeply Databricks-centric and you value tight Spark integration
- → You already have a large Delta installed base that works well
- → Batch processing performance on TPC-DS benchmarks is a primary concern
- → Your primary workload is high-frequency CDC (Change Data Capture) from databases
- → You need record-level upserts at massive scale (Uber, Robinhood, Walmart use case)
- → Merge-on-Read with async background compaction is your preferred write pattern
Iceberg Isn't a Table Format.
It's a New Data Contract.
Apache Iceberg's success cannot be explained by a single killer feature. It is the accumulation of correct architectural decisions made years ago — an open specification instead of an implementation, hidden partitioning instead of user-visible partition columns, a manifest tree instead of a flat transaction log — that compound into a system uniquely suited to the fragmented, multi-engine, cloud-native reality of modern data infrastructure.
The $1 billion Tabular acquisition, AWS S3 Tables, Snowflake's full commitment, Confluent Tableflow, and Apache Polaris are not independent events. They are all acknowledgments of the same conclusion the industry has quietly reached: when data needs to flow freely between engines without copying, without lock-in, and without silent corruption, Apache Iceberg is the answer.
For data engineers building new lakehouses today, Iceberg is not a bet on a technology. It is a bet on the principle that open standards — properly designed, with governance given to the community — outcompete proprietary formats in the long run. History, in data infrastructure, has consistently favored the open standard. Apache Iceberg looks set to be the clearest example of this truth in a generation.
The Iceberg Is Just the Tip
Iceberg v3, geospatial support, streaming-native writes, and Apache Polaris as a production-grade catalog: the format that started as Netflix's internal tool is now the foundation of the next decade of data infrastructure.
Sources & Further Reading
- Onehouse. (Oct 2025). Apache Hudi vs Delta Lake vs Apache Iceberg: Lakehouse Feature Comparison. onehouse.ai
- Dremio. (2025). Comparison of Data Lake Table Formats: Apache Iceberg, Apache Hudi and Delta Lake. dremio.com
- Kรถrรผkcรผ, Y.A. (Feb 2026). Apache Iceberg vs Delta Lake vs Hudi: The Real Differences Nobody Explains Simply. Medium.
- Xenoss. (Aug 2025). Apache Iceberg vs Delta Lake vs Hudi Comparison. xenoss.io
- Snowflake Engineering Blog. (2025). Apache Polaris: The End of Data Vendor Lock-In. snowflake.com
- TechTarget. (Apr 2025). Snowflake broadens open-source embrace, ups Iceberg support. techtarget.com
- Atlan. (Mar 2025). Apache Iceberg in Snowflake: A Practical Guide. atlan.com
- VentureBeat. (Jun 2024). Snowflake unveils Polaris, a vendor-neutral open catalog for Apache Iceberg. venturebeat.com
- LakeFS. (Mar 2025). Hudi vs Iceberg vs Delta Lake: Data Lake Table Formats Compared. lakefs.io
- Reintech. (Apr 2026). Apache Iceberg vs Delta Lake vs Apache Hudi 2026: Table Format Comparison. reintech.io
Comments
Post a Comment