Apache Iceberg: The Open Table Format Reshaping the Data Lakehouse

Apache Iceberg: The Open Table Format Reshaping the Data Lakehouse

Data Engineering · Open Table Formats · 2025

Apache Iceberg:
The Open Table Format
Quietly Winning the Data Wars

Born at Netflix to solve petabyte-scale chaos, Apache Iceberg has become the industry's de facto standard for the modern data lakehouse — and for good reason.

By Arabinda Mohpatra
Published May 2025
Read time ~18 min
SCROLL TO READ
$1B+
Databricks acquires Tabular (Iceberg's creator) — 2024
100%
Snowflake commits to Apache Iceberg as sole open format
7+
Major cloud / engine providers natively supporting Iceberg
#1
Most planned-adoption format per Dremio's 2024 survey

Netflix Had a Problem.
A Petabyte-Scale Problem.

It was 2017, and Netflix's data engineers were fighting a war on two fronts: the relentless growth of their streaming data and the painful limitations of Apache Hive, the then-dominant table format. Querying petabytes of Parquet files on Amazon S3 was like trying to find a specific grain of sand on a beach — without a map. File listings were slow, schema changes were dangerous, and concurrent writes were a gamble.

Ryan Blue and Dan Tsui, engineers at Netflix, decided to build something different. They designed Apache Iceberg from first principles — not as a patch on Hive, but as a complete reimagining of what a table format should be in a cloud-native, distributed world. The result, open-sourced in 2018 and donated to the Apache Software Foundation in 2019, has since grown into the most consequential architectural choice in modern data engineering.

"We're 100% committed to Apache Iceberg. Customers want thriving open ecosystems, and they don't want to be locked in."

— Christian Kleinerman, EVP of Product, Snowflake · Data Cloud Summit 2024

The timing of Iceberg's rise is no accident. The data lakehouse architecture — the idea of combining the cheap, scalable storage of a data lake with the transactional and analytical power of a data warehouse — demands exactly what Iceberg provides: a reliable, engine-agnostic, open metadata layer sitting on top of Parquet files in object storage.

What Is Apache Iceberg,
Really?

At its core, Apache Iceberg is an open table format specification. Not a storage engine. Not a query engine. A specification that precisely defines how metadata, manifest files, data files, and catalogs interact — such that any engine implementing the spec can read and write Iceberg tables correctly, without depending on a specific runtime library.

This distinction matters enormously. The open spec is why Spark, Flink, Trino, Snowflake, DuckDB, BigQuery, and dozens of other tools can all interoperate on the exact same Iceberg table without conflict, coordination, or copying data.

◈ Apache Iceberg — Three-Layer Architecture (How a Query Resolves)
① CATALOG LAYER Hive Metastore · AWS Glue · Apache Polaris (REST) · Nessie · JDBC TABLE NAME my_catalog.analytics.user_events CURRENT METADATA POINTER s3://datalake/user_events/metadata/v5.metadata.json Pointer updated atomically on every commit resolves current metadata file ② METADATA LAYER metadata.json format_version: 2 current_snapshot_id: 7392840193 schemas: [v1, v2, v3] partition_specs: [spec-0] snapshots: [snap-1 … snap-7] → Full schema history → Partition evolution log → Time-travel anchors → manifest-list-7392.avro manifest-list.avro One entry per manifest file: manifest-00001.avro 6,230 rows manifest-00002.avro 4,890 rows manifest-00003.avro 8,100 rows Added/deleted file counts Partition ranges per entry Snapshot-level summary stats manifest-00001.avro Per data-file entry: file: data-00001.parquet partition: {month=2024-01} Column statistics (per column): lower_bound: min value upper_bound: max value null_value_count: 0 record_count: 6,230 file_size_bytes: 12.4 MB 3-Level Pruning: ① Partition filter skip whole manifests ② Manifest filter skip file groups ③ Column stats skip individual data files Zero files opened before pruning ends resolves to actual data files ③ DATA LAYER Cloud Object Storage — S3 · GCS · ADLS · MinIO — No proprietary format data-00001.parquet year=2024/month=01/ 6,230 rows · Snappy Columnar · Arrow memory data-00002.parquet year=2024/month=02/ 4,890 rows · Snappy Columnar · Arrow memory data-00003.parquet year=2024/month=03/ 8,100 rows · Snappy Columnar · Arrow memory delete-00001.parquet Equality / Position Deletes Supports row-level UPDATE MERGE · DELETE (v2 format) QUERY ENGINES — SPARK · FLINK · TRINO · SNOWFLAKE · BIGQUERY · DUCKDB · ATHENA · DREMIO

The three-layer architecture is the key to everything Iceberg does. The catalog layer stores a pointer to the current metadata file. The metadata layer contains snapshots, partition specs, and schema history — along with manifest files that track every data file along with per-column statistics (min, max, null counts). The data layer is just Parquet files in your object storage — no proprietary format, no lock-in.

When a query runs, Iceberg performs three-level pruning: first eliminating irrelevant partitions, then pruning manifest files, then pruning individual data files using column statistics — all before opening a single file. This is how Iceberg delivers sub-second query planning on tables with billions of rows.

The Swamp Before Iceberg:
Why Hive-Partitioned Tables Failed

To understand why Iceberg matters, you need to feel the pain it replaces. Hive-partitioned tables on S3 were the de facto standard for years — and they were brutal to maintain at scale.

Before Iceberg
  • Partition columns leak into schema, breaking downstream queries
  • S3 file listing for metadata = O(n files) — catastrophically slow at scale
  • No ACID transactions: concurrent writes corrupt tables silently
  • Schema changes (column renames, type changes) are destructive operations
  • No time-travel: yesterday's data is gone once overwritten
  • Modifying a partition requires rewriting the entire partition
  • Engine-specific: same table can't be safely shared between Spark, Presto, Hive
  • DELETE and UPDATE operations are either impossible or require full rewrites
Apache Iceberg
  • Hidden partitioning: users query on real columns, transforms applied automatically
  • Metadata stored in manifest files — no S3 listing ever required
  • Full ACID snapshot isolation: reads never blocked by concurrent writes
  • Schema evolution: add, rename, drop, reorder columns safely via metadata
  • Time travel: query any historical snapshot with AS OF TIMESTAMP
  • Partition evolution: change partition strategy as a pure metadata operation
  • Open spec: any engine reads the same table identically, simultaneously
  • Row-level deletes and updates via position and equality delete files

What Iceberg Does That
No Other Format Can

All three major open table formats — Iceberg, Delta Lake, and Apache Hudi — solve ACID transactions, schema evolution, and time travel. The real differentiation lies in architectural choices that only become visible when your tables grow to petabyte scale, your team changes over time, or you need to mix query engines.

Partition Evolution — Iceberg Only
Change your partitioning strategy (e.g., daily → hourly) as a pure metadata operation. Old files retain their original partition layout; new files use the new layout. Queries span both seamlessly. Delta Lake and Hudi have no equivalent — this is the feature most often cited by data engineers when justifying Iceberg.
Hidden Partitioning
Users query on event_time, not event_time_day. Iceberg applies partition transforms (bucket, truncate, year/month/day/hour) transparently. No more broken queries from stale partition column knowledge. No more "why is my query scanning everything?" support tickets.
Open Specification Model
Unlike Delta Lake, which is tightly coupled to Databricks' proprietary catalog implementation, Iceberg is defined by a published open specification. Any engine can independently implement it. This is why Snowflake, AWS, Google, Microsoft, and Confluent have all adopted Iceberg — there's no single vendor controlling the roadmap.
Scalable Metadata Architecture
Iceberg's manifest file tree scales to billions of files without performance degradation. Delta Lake's transaction log approach requires replaying JSON commits from the last checkpoint, which can become expensive for tables with millions of files. Iceberg's pruning hierarchy (partition → manifest → file) is O(log n), not O(n).
Snapshot Isolation & Multi-Engine Writes
Iceberg's atomic metadata swap model allows multiple engines to write to the same table concurrently with full snapshot isolation. A Spark job and a Flink streaming job can both write to the same Iceberg table; readers always see a consistent snapshot and are never blocked.
Column-Level Statistics in Manifests
Every manifest file stores min, max, and null counts for every column in every data file it tracks. Query engines exploit this for aggressive file pruning without opening files. The result: dramatically reduced I/O costs on selective queries — a game-changer for cloud-native pricing models where you pay per byte scanned.

Iceberg vs Delta Lake vs Hudi:
The Honest Comparison

The table format debate of 2025 has matured beyond tribalism. All three formats are production-grade. The right choice depends on your workload, your engine ecosystem, and your team's operational preferences. Here is an honest feature comparison grounded in the latest research.

Feature / Dimension Apache Iceberg Delta Lake Apache Hudi
Open specification Full open spec Coupled to Databricks catalog Partial
Partition evolution Metadata only — unique Liquid Clustering alternative Not supported
Multi-engine support Broadest ecosystem Strong Spark / Databricks Spark + Flink primary
Streaming / CDC workloads Good Good Best — designed for this
Metadata scalability Excellent (manifest tree) Good (log replay) Good
Hidden partitioning Native Not available Not available
Time travel / snapshots Full Full Full
ACID transactions Full Full Full
Schema evolution Comprehensive Comprehensive Good
Vendor lock-in risk Minimal — open catalog Moderate — Unity Catalog Low
Fortune 500 installed base Growing fast — #1 new adoption Largest — 60%+ via Databricks Niche — streaming specialists
Catalog options Hive · Glue · Polaris · Nessie · REST Unity · Hive · Glue Hive · custom

The clearest signal of Iceberg's momentum: Databricks itself — Delta Lake's creator — acquired Tabular (the company founded by Iceberg's original Netflix creators) for over $1 billion in mid-2024, and now offers both Delta and managed Iceberg tables. Even Delta's home platform hedged its bets on the format that has become the interoperability lingua franca.

Snowflake + Apache Iceberg:
The Integration That Changes Everything

Of all the Iceberg integrations in the ecosystem, Snowflake's is arguably the most consequential. Snowflake made Iceberg tables generally available in June 2024, and then went further — open-sourcing the Polaris Catalog (now Apache Polaris), effectively donating a vendor-neutral Iceberg REST catalog to the entire industry. This was not a small gesture.

Apache Polaris: Snowflake's Open Bet on Interoperability

In June 2024, Snowflake announced Polaris Catalog — an open-source, vendor-neutral Iceberg catalog implementation built on the Iceberg REST API standard. It was donated to the Apache Software Foundation, and in 2025 it graduated as an Apache Top-Level Project with contributions from hundreds of engineers across Google, Microsoft, AWS, Confluent, and Dremio.

The strategic implication is profound: instead of coupling catalog and format (as Databricks does with Unity Catalog and Delta), Snowflake chose to standardize on an open catalog. This means any engine can interoperate with Snowflake-managed Iceberg data — Spark, Flink, Trino, DuckDB, and Dremio can all read and write the same tables managed by Polaris without data movement.

How the Snowflake–Iceberg Integration Works

Snowflake offers two modes for working with Iceberg tables, catering to different architectural philosophies:

  • 01
    Snowflake-Managed Iceberg Tables
    Snowflake writes and manages the Iceberg metadata. Data files are stored in your external cloud storage (S3, GCS, Azure). You get Snowflake's full query performance, governance (column masking, row access policies, object tagging via Horizon), and data sharing — all applied natively to Iceberg tables as if they were native Snowflake objects.
  • 02
    External Iceberg Tables (Read from External Catalog)
    Point Snowflake at an Iceberg table managed by an external catalog (AWS Glue, Apache Polaris, or Hive). Snowflake queries the table without moving data. Ideal for organizations where Spark or Flink is the primary write engine and Snowflake is the analytics layer. No ETL required — just a catalog integration object and a metadata refresh.
  • 03
    Snowpipe Streaming Directly to Iceberg
    Snowflake's Snowpipe Streaming SDK now writes data directly to Iceberg table format in real-time. Organizations can stream Kafka topics, IoT events, or CDC streams into Iceberg tables at sub-minute latency — then query them immediately in Snowflake or any other Iceberg-compatible engine.
  • 04
    Open Catalog (Managed Apache Polaris)
    Snowflake hosts Apache Polaris as a managed service — the Snowflake Open Catalog. It implements the Iceberg REST Catalog API, providing centralized governance for Iceberg tables across any compatible engine. Role-based access, principal roles, catalog-level storage scoping, and validation are all built in.
-- Create an external volume pointing to your S3 bucket CREATE EXTERNAL VOLUME iceberg_vol STORAGE_LOCATIONS = ( ( NAME = 'my-s3-us-east' STORAGE_PROVIDER = 'S3' STORAGE_BASE_URL = 's3://my-data-lake/iceberg/' STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789:role/sf-iceberg-role' ) ); -- Create a native Iceberg table — stored in S3, managed by Snowflake CREATE ICEBERG TABLE events ( event_id VARCHAR, event_time TIMESTAMP_NTZ, user_id VARCHAR, payload VARIANT ) CATALOG = 'SNOWFLAKE' EXTERNAL_VOLUME = 'iceberg_vol' BASE_LOCATION = 'events/'; -- Time travel works natively — as with any Snowflake table SELECT * FROM events AT (TIMESTAMP => '2025-03-01 00:00:00'::TIMESTAMP_NTZ);

Advantages of the Snowflake + Iceberg Stack

Zero Vendor Lock-In on Storage
Your data lives in your S3/GCS/ADLS bucket in open Parquet files. If you ever leave Snowflake, your data comes with you — readable by any Iceberg-compatible engine. This was simply not true with proprietary Snowflake internal storage.
Snowflake Governance on Open Data
Column masking, row access policies, object tags, and data sharing from Snowflake Horizon apply to Iceberg tables as if they were native. You get enterprise-grade security without migrating data into a proprietary format.
Multi-Engine Access from One Catalog
Via Polaris/Open Catalog, your Spark pipelines, Flink streaming jobs, and Snowflake analytics can all point at the same catalog, see the same tables, and respect the same governance rules — without copying data between systems.
Iceberg v3 Support (2025)
Snowflake has committed to supporting the Iceberg v3 spec, which adds row-level change data capture, geospatial data types, nanosecond-precision timestamps, and deletion vectors — critical capabilities for high-frequency trading, IoT, and operational analytics.

The Momentum Is
Undeniable

By every measure available in 2025, Apache Iceberg has achieved escape velocity. The pattern across the industry is remarkably consistent: Iceberg has become the interoperability lingua franca. Even platforms that don't use Iceberg natively now expose their data as Iceberg for cross-engine access.

2017–2018
Netflix engineers Ryan Blue and Dan Tsui build and open-source the first version of Iceberg to solve petabyte-scale Parquet management on S3.
2019
Donated to Apache Software Foundation. Iceberg becomes a top-level Apache project. Adoption begins spreading beyond Netflix.
2022
Snowflake announces Iceberg table preview at its Data Cloud Summit. AWS and Google begin adding native Iceberg support to Glue and BigQuery.
June 2024
Snowflake makes Iceberg tables GA and open-sources the Polaris Catalog, donated to the Apache Software Foundation. Confluent's Tableflow converts Kafka topics directly to Iceberg. AWS launches S3 Tables with built-in Iceberg support.
Mid-2024
Databricks acquires Tabular (the company founded by Iceberg's original Netflix creators) for over $1 billion — the single clearest signal of Iceberg's strategic importance.
2025
Apache Polaris graduates to a Top-Level Apache Project. Iceberg v3 released with streaming, geospatial, and deletion vector support. Dremio's survey shows Iceberg on track to surpass Delta as the most-used format within three years.

Who Is Using Iceberg at Scale?

Netflix
Petabyte-scale streaming analytics. The origin site — hundreds of production Iceberg tables backing recommendation and content systems.
Apple
Major contributor to the Iceberg project. Uses Iceberg for exabyte-scale internal analytics infrastructure.
LinkedIn
Large-scale data lakehouse migration to Iceberg for member analytics and ML feature stores.
Adobe
Creative Cloud analytics on Iceberg tables, leveraging multi-engine access across Spark and Trino.
Airbnb
Migrated from Hive-partitioned tables to Iceberg for their core pricing and availability data platform.
AWS (S3 Tables)
Native Iceberg tables built directly into S3 — the ultimate sign of cloud-platform endorsement.

When Should You
Choose Iceberg?

Despite its broad advantages, Iceberg is not the answer to every question. Here is an honest assessment of where each format excels:

Choose Iceberg When...
  • You need multi-engine interoperability (Spark writes, Snowflake queries, Flink streams)
  • Avoiding vendor lock-in is a top priority
  • Your tables will grow to petabyte scale with frequent schema or partition changes
  • Your cloud bill is driven by data scanned (you need maximum file pruning)
  • You're building on Snowflake, Trino, Dremio, or BigQuery as primary query engines
Consider Delta Lake When...
  • Your stack is deeply Databricks-centric and you value tight Spark integration
  • You already have a large Delta installed base that works well
  • Batch processing performance on TPC-DS benchmarks is a primary concern
Consider Hudi When...
  • Your primary workload is high-frequency CDC (Change Data Capture) from databases
  • You need record-level upserts at massive scale (Uber, Robinhood, Walmart use case)
  • Merge-on-Read with async background compaction is your preferred write pattern

Iceberg Isn't a Table Format.
It's a New Data Contract.

Apache Iceberg's success cannot be explained by a single killer feature. It is the accumulation of correct architectural decisions made years ago — an open specification instead of an implementation, hidden partitioning instead of user-visible partition columns, a manifest tree instead of a flat transaction log — that compound into a system uniquely suited to the fragmented, multi-engine, cloud-native reality of modern data infrastructure.

The $1 billion Tabular acquisition, AWS S3 Tables, Snowflake's full commitment, Confluent Tableflow, and Apache Polaris are not independent events. They are all acknowledgments of the same conclusion the industry has quietly reached: when data needs to flow freely between engines without copying, without lock-in, and without silent corruption, Apache Iceberg is the answer.

For data engineers building new lakehouses today, Iceberg is not a bet on a technology. It is a bet on the principle that open standards — properly designed, with governance given to the community — outcompete proprietary formats in the long run. History, in data infrastructure, has consistently favored the open standard. Apache Iceberg looks set to be the clearest example of this truth in a generation.

The Iceberg Is Just the Tip

Iceberg v3, geospatial support, streaming-native writes, and Apache Polaris as a production-grade catalog: the format that started as Netflix's internal tool is now the foundation of the next decade of data infrastructure.

Sources & Further Reading

  1. Onehouse. (Oct 2025). Apache Hudi vs Delta Lake vs Apache Iceberg: Lakehouse Feature Comparison. onehouse.ai
  2. Dremio. (2025). Comparison of Data Lake Table Formats: Apache Iceberg, Apache Hudi and Delta Lake. dremio.com
  3. Kรถrรผkcรผ, Y.A. (Feb 2026). Apache Iceberg vs Delta Lake vs Hudi: The Real Differences Nobody Explains Simply. Medium.
  4. Xenoss. (Aug 2025). Apache Iceberg vs Delta Lake vs Hudi Comparison. xenoss.io
  5. Snowflake Engineering Blog. (2025). Apache Polaris: The End of Data Vendor Lock-In. snowflake.com
  6. TechTarget. (Apr 2025). Snowflake broadens open-source embrace, ups Iceberg support. techtarget.com
  7. Atlan. (Mar 2025). Apache Iceberg in Snowflake: A Practical Guide. atlan.com
  8. VentureBeat. (Jun 2024). Snowflake unveils Polaris, a vendor-neutral open catalog for Apache Iceberg. venturebeat.com
  9. LakeFS. (Mar 2025). Hudi vs Iceberg vs Delta Lake: Data Lake Table Formats Compared. lakefs.io
  10. Reintech. (Apr 2026). Apache Iceberg vs Delta Lake vs Apache Hudi 2026: Table Format Comparison. reintech.io

Comments