🧊 Snowflake Micro-Partitions & Table Clustering

🧊 Snowflake Micro-Partitions & Table Clustering

Published: None

Source: https://www.linkedin.com/pulse/snowflake-micro-partitions-table-clustering-arabinda-mohapatra-sbgjf?trackingId=xP9nN9ecSTuF1gxQ9dl6jQ%3D%3D


🧊 Snowflake Micro-Partitions & Table Clustering

Running Kafka streams after dark, diving into genetic code by daylight, and wrestling with Databricks and Tableflow in every spare moment—sleep is optional

1️⃣ What are Snowflake Micro-Partitions?

  • Snowflake automatically divides tables into micro-partitions
  • You do NOT define partitions manually
  • Micro-partition = smallest unit of storage in Snowflake

Key Characteristics

  • Size: 50–500 MB (uncompressed)
  • Stored in columnar format
  • Automatically compressed
  • Created based on data insertion order (natural clustering)
  • A table can have millions of micro-partitions

What metadata Snowflake stores per micro-partition?

  • Min & Max values for each column
  • Number of distinct values
  • Other optimization statistics

📌 This metadata enables partition pruning (skip unnecessary data during query execution)


2️⃣ Why Micro-Partitions are Powerful

Benefits

  • ✅ No manual partition management
  • ✅ Very fine-grained pruning → faster queries
  • ✅ Avoids data skew (small, uniform partitions)
  • ✅ Columnar storage → scans only required columns
  • ✅ Column-level compression (automatic & optimized)


3️⃣ Natural Clustering (Default Behavior)

  • Data is stored in the order it is loaded
  • This is called natural clustering
  • Works well only if query filters match insert order

Example:

  • If data is inserted by EMP_ID
  • Queries filtering on EMP_ID → good performance
  • Queries filtering on other columns → poor pruning


4️⃣ What is Table Clustering?

Definition

Table clustering physically reorganizes data so similar values are stored closer together across micro-partitions.

  • Uses a CLUSTER BY key
  • Improves partition pruning
  • Reduces scan cost
  • Improves query performance


5️⃣ Why Clustering is Needed (Real Example)

Example Table: EMPLOYEE

Concept

  • Data is stored in insertion order
  • Micro-partitions contain mixed values
  • Queries on non-insert columns scan multiple micro-partitions

Columns:

  • EMP_ID
  • DEPT_ID


🔹 Before Clustering (Natural Order: EMP)

| EMP_ID | DEPT_ID |
| ------ | ------- |
| 1      | 1       |
| 2      | 2       |
| 3      | 1       |
| 4      | 3       |
| 5      | 2       |
| 6      | 1       | 

Micro-Partition 1 Micro-Partition 2 Micro-Partition 3

---------------------------------------------------------------

EMP 1 (D1) EMP 3 (D1) EMP 5 (D2)

EMP 2 (D2) EMP 4 (D3) EMP 6 (D1)


Query:

🔴 Query: WHERE DEPT_ID = 1
❌ Problem:

DEPT_ID = 1 exists in all 3 micro-partitions

Snowflake scans 3 partitions 



After Clustering (CLUSTER BY DEPT_ID)

  • Visual Explanation (After Clustering)

ALTER TABLE EMPLOYEE CLUSTER BY (DEPT_ID); 
Micro-Partition 1        Micro-Partition 2        Micro-Partition 3
---------------------------------------------------------------
EMP 1 (D1)              EMP 2 (D2)              EMP 4 (D3)
EMP 3 (D1)              EMP 5 (D2)
EMP 6 (D1) 

What Happens

  • Data is physically reordered
  • Similar DEPT_ID values are co-located
  • Partition pruning improves

✅ Result:

  • Query scans only 1 micro-partition
  • Faster response
  • Lower compute cost


6️⃣ What is Clustering Depth?

Definition

Clustering Depth measures data overlap across micro-partitions for a given column or clustering key.

  • Lower depth = better clustering
  • Higher depth = more overlap → more scanning

Simple Understanding

ScenarioClustering DepthEmpty table0Perfect clustering~1Heavy overlapHigh value

📌 Snowflake tries to keep depth low, but DML operations disturb it over time.


❌ High Clustering Depth (Bad)

  • Left side – Bad clustering (High Clustering Depth) DEPT_ID = 1 is scattered across multiple micro-partitions. When you query DEPT_ID = 1, Snowflake must scan many partitions, so performance is slower.
  • Right side – Good clustering (Low Clustering Depth) DEPT_ID = 1 is stored together in a single micro-partition. The same query scans only one partition, giving faster results and lower cost.

Article content

7️⃣ How to Check Clustering Health

Snowflake provides system functions:

Average Clustering Depth

SELECT SYSTEM$CLUSTERING_DEPTH('EMPLOYEE', 'DEPT_ID'); 

  • 4.28

Meaning

  • On average, each DEPT_ID value spans ~4 micro-partitions
  • Lower is better
  • Ideal values:

Detailed Clustering Info

SELECT SYSTEM$CLUSTERING_INFORMATION('EMPLOYEE', '(DEPT_ID)'); 


{

"cluster_by_keys" : "DEPT_ID",

"total_partition_count" : 73,

"total_constant_partition_count" : 1,

"average_overlaps" : 33.7808,

"average_depth" : 23.3836,

"partition_depth_histogram" : {

"00000" : 0,

"00001" : 1,

"00002" : 0,

"00003" : 0,

"00004" : 0,

"00005" : 0,

"00006" : 0,

"00007" : 0,

"00008" : 0,

"00009" : 0,

"00010" : 0,

"00011" : 0,

"00012" : 0,

"00013" : 0,

"00014" : 0,

"00015" : 0,

"00016" : 0,

"00032" : 72

},

"clustering_errors" : [ ]

}

Article content


"cluster_by_keys" : "DEPT_ID"

  • The table is clustered using DEPT_ID
  • Snowflake tries to physically group rows with the same DEPT_ID
  • All pruning and depth calculations are based on this column

"total_partition_count" : 73

  • The EMPLOYEE table is split into 73 micro-partitions
  • This is the total storage blocks Snowflake scans from
  • More partitions = more opportunity for pruning (if clustering is good)

"average_overlaps" : 33.7808

  • On average, each DEPT_ID value appears in ~34 micro-partitions
  • This is a red flag 🚩
  • High overlap means:

"average_depth" : 23.3836

  • This is the Clustering Depth
  • Meaning:

WHERE DEPT_ID = 10--Snowflake scans ~23 micro-partitions on average

This table will benefit from reclustering or redesign:

ALTER TABLE EMPLOYEE RECLUSTER;

Healthy table ✅

  • Low clustering depth
  • Low overlaps
  • Queries scan fewer MPs
  • Lower warehouse cost

Unhealthy table ❌

  • Depth > 3
  • Frequent full-table scans
  • High query cost
  • Needs:

One-line mental model (easy to remember)

Good clustering = same DEPT_ID stays together

8️⃣ How to Choose a Good Cluster Key

Best Practices

✅ Choose columns:

  • Frequently used in WHERE filters
  • Used in JOIN conditions
  • With medium cardinality

❌ Avoid:

  • Very low cardinality (e.g. gender)
  • Very high cardinality (e.g. nanosecond timestamps)


Multi-Column Cluster Key Rules

  • Max 3–4 columns
  • Order columns:

Example:

CLUSTER BY (DEPT_ID, JOINING_DATE) 

9️⃣ Manual vs Automatic Clustering

  • Snowflake clustering is fully automatic
  • Runs as a background process
  • Compute tracked under:

You only:

  • Define the cluster key
  • Monitor cost vs performance


🔟 What Triggers Re-Clustering?

  • INSERT
  • UPDATE
  • DELETE
  • MERGE
  • COPY

Snowflake decides when to recluster using internal logic No fixed schedule No query downtime during reclustering


Cost of Table Clustering

1️⃣ Compute Cost

  • Credits consumed by AUTOMATIC_CLUSTERING warehouse
  • Charged based on actual usage

2️⃣ Storage Cost

  • Reclustering creates new micro-partitions
  • Old partitions remain until cleanup
  • More frequent DML → higher storage usage


Final Words:

  • Micro-partitioning is automatic & core to Snowflake
  • Clustering improves query performance via pruning
  • Use clustering only when queries slow down
  • Always balance performance gains vs cost
  • Query performance is the best indicator of good clustering

Refer-https://community.snowflake.com/s/article/understanding-micro-partitions-and-data-clustering




Comments