Use liquid clustering for Delta tables

Published: None

Source: https://www.linkedin.com/pulse/use-liquid-clustering-delta-tables-arabinda-mohapatra-ce86c?trackingId=ux5WBiGsSi2CFJi4kUVf3A%3D%3D

Use liquid clustering for Delta tables

Arabinda Mohapatra

Running Kafka streams after dark, diving into genetic code by daylight, and wrestling with Databricks and Tableflow in every spare moment—sleep is optional

December 28, 2024

### 🌟 What is Liquid Clustering? Liquid clustering improves on traditional partitioning and ZORDER techniques, providing the flexibility to redefine clustering columns without rewriting existing data. This allows your data layout to evolve alongside your analytic needs over time.

### 🔍 Use Cases for Liquid Clustering

- High Cardinality Columns: Ideal for tables often filtered by high cardinality columns.

- Skewed Data Distribution: Helps manage tables with significant skew in data distribution.

- Rapid Growth: Perfect for tables that grow quickly and require maintenance and tuning.

- Changing Access Patterns: Adapts to tables with changing access patterns over time.

- Optimal Partitioning: Avoids issues with too many or too few partitions in traditional partitioning.

### 🚀 Enabling Liquid Clustering

Enable liquid clustering when creating a table using the CLUSTER BY phrase.

 -- Create a table with liquid clustering
CREATE OR REPLACE TABLE  dev.bronze.events (
    event_id INT,
    event_date DATE,
    details STRING
) USING DELTA
CLUSTER BY (event_date);

--change clustering columns for a table at any time by running an ALTER TABLE command(Multiple Columns)
ALTER TABLE dev.bronze.events CLUSTER BY (event_date,event_id);

--You can also turn off clustering by setting the columns to NONE, as in the following example:
ALTER TABLE  dev.bronze.events CLUSTER BY NONE;

Liquid clustering is incremental, meaning that data is only rewritten as necessary to accommodate data that needs to be clustered. Already clustered data files with different clustering columns are not rewritten

OPTIMIZE dev.bronze.events;

🧐 How to Choose Clustering Columns in Delta Lake 🧐

Limitations to Note

Statistics Requirement: Only columns with statistics collected can be specified as clustering columns. By default, statistics are collected for the first 32 columns in a Delta table.
Clustering Column Limit: You can specify up to 4 clustering columns.
Feature Flags for Delta Lake 3.1: Enable clustering with the feature flag: spark.databricks.delta.clusteredTable.enableClusteringTablePreview.
Features not supported in this preview: ZCube based incremental clustering ALTER TABLE ... CLUSTER BY to change clustering columns DESCRIBE DETAIL to inspect current clustering columns
Delta Lake 3.2 Enhancements: From Delta Lake 3.2 onwards, the preview flag is removed, and the above features are fully supported.

Search This Blog