Iceberg Architecture: A Three-Tier Structure

Apache Iceberg is an open table format designed for large-scale analytic datasets. It provides a reliable and efficient way to manage big data in data lakes. Iceberg tables have a three-tier structure comprising metadata files, manifest list files, and manifest files. Here’s a detailed explanation of each tier and how they work together:

1. Metadata Files

Metadata files are at the top level of Iceberg's architecture. They store information about the table’s schema, partitioning, and snapshot references.

Table Metadata: This includes the table schema, partitioning information, and properties. It keeps track of the current snapshot ID and a list of all snapshots.
Snapshots: Each snapshot points to a specific state of the table at a given time. Snapshots reference manifest list files and contain information about the data files added, deleted, or modified in that snapshot.
Versioned: Iceberg supports versioned metadata files, which means changes to the table structure or properties are recorded as new versions, allowing easy rollbacks and historical queries.

2. Manifest List Files

Manifest list files act as an intermediary layer between metadata files and manifest files. Each manifest list file points to a set of manifest files.

Snapshot Reference: Each snapshot in the metadata file references a manifest list file. The manifest list file lists all the manifest files included in that snapshot.
Efficient Reads: By grouping manifest files, manifest list files reduce the overhead of reading individual data files directly. This makes scanning large tables more efficient.
Data Operations: When a snapshot is created, updated, or deleted, the corresponding manifest list file is updated to reflect these changes.

3. Manifest Files

Manifest files are the lowest tier in Iceberg's architecture. They contain detailed information about the actual data files stored in the table.

File Metadata: Each manifest file contains metadata about a group of data files, including their paths, partition information, and metrics (e.g., row counts, column statistics).
Partition Pruning: Manifest files enable efficient partition pruning by storing detailed information about data files. This helps in quickly identifying relevant files for a query based on partition filters.
Granular Updates: When data files are added, removed, or modified, the corresponding manifest file is updated. This allows Iceberg to manage changes at a fine-grained level without rewriting large amounts of data.

How They Work Together

Initial Write: When data is first written to an Iceberg table, data files are created and their metadata is recorded in manifest files.
Manifest List Creation: A manifest list file is created to group these manifest files, which in turn is referenced by a snapshot in the metadata file.
Metadata Update: The metadata file is updated to include the new snapshot, pointing to the new manifest list file.
Subsequent Writes: As new data is added, new manifest files are created, and manifest list files are updated to include them. New snapshots are created, updating the metadata file accordingly.
Queries: When querying the table, Iceberg reads the metadata file to identify the latest snapshot. It then reads the manifest list file to find the relevant manifest files and finally reads the manifest files to locate the actual data files.

Benefits

Efficiency: The three-tier structure allows for efficient reads and writes by minimizing the amount of metadata read during query planning and execution.
Scalability: Iceberg is designed to handle petabyte-scale datasets, making it suitable for large data lakes.
Consistency: By using metadata and manifest files, Iceberg ensures that data operations are atomic and consistent, supporting ACID properties.
Time Travel: The versioned metadata files allow users to query historical versions of the data, enabling time travel queries.

Conclusion

Apache Iceberg’s three-tier architecture of metadata files, manifest list files, and manifest files provides a robust and efficient framework for managing large-scale analytic datasets. This structure ensures efficient data operations, scalability, and data consistency, making Iceberg an excellent choice for modern data lake architectures.

#BigData #DataLakes #Iceberg #DataArchitecture #DataManagement #Analytics #Scalability #Efficiency #ACID

Search This Blog