🧊 Snowflake Micro-Partitions & Table Clustering

Published: None

Source: https://www.linkedin.com/pulse/snowflake-micro-partitions-table-clustering-arabinda-mohapatra-sbgjf?trackingId=xP9nN9ecSTuF1gxQ9dl6jQ%3D%3D

🧊 Snowflake Micro-Partitions & Table Clustering

Arabinda Mohapatra

Running Kafka streams after dark, diving into genetic code by daylight, and wrestling with Databricks and Tableflow in every spare moment—sleep is optional

January 31, 2026

1️⃣ What are Snowflake Micro-Partitions?

Snowflake automatically divides tables into micro-partitions
You do NOT define partitions manually
Micro-partition = smallest unit of storage in Snowflake

Key Characteristics

Size: 50–500 MB (uncompressed)
Stored in columnar format
Automatically compressed
Created based on data insertion order (natural clustering)
A table can have millions of micro-partitions

What metadata Snowflake stores per micro-partition?

Min & Max values for each column
Number of distinct values
Other optimization statistics

📌 This metadata enables partition pruning (skip unnecessary data during query execution)

2️⃣ Why Micro-Partitions are Powerful

Benefits

✅ No manual partition management
✅ Very fine-grained pruning → faster queries
✅ Avoids data skew (small, uniform partitions)
✅ Columnar storage → scans only required columns
✅ Column-level compression (automatic & optimized)

3️⃣ Natural Clustering (Default Behavior)

Data is stored in the order it is loaded
This is called natural clustering
Works well only if query filters match insert order

Example:

If data is inserted by EMP_ID
Queries filtering on EMP_ID → good performance
Queries filtering on other columns → poor pruning

4️⃣ What is Table Clustering?

Definition

Table clustering physically reorganizes data so similar values are stored closer together across micro-partitions.

Uses a CLUSTER BY key
Improves partition pruning
Reduces scan cost
Improves query performance

5️⃣ Why Clustering is Needed (Real Example)

Example Table: EMPLOYEE

Concept

Data is stored in insertion order
Micro-partitions contain mixed values
Queries on non-insert columns scan multiple micro-partitions

Columns:

EMP_ID
DEPT_ID

🔹 Before Clustering (Natural Order: EMP)

| EMP_ID | DEPT_ID |
| ------ | ------- |
| 1      | 1       |
| 2      | 2       |
| 3      | 1       |
| 4      | 3       |
| 5      | 2       |
| 6      | 1       |

Micro-Partition 1 Micro-Partition 2 Micro-Partition 3

---------------------------------------------------------------

EMP 1 (D1) EMP 3 (D1) EMP 5 (D2)

EMP 2 (D2) EMP 4 (D3) EMP 6 (D1)

Query:

🔴 Query: WHERE DEPT_ID = 1
❌ Problem:

DEPT_ID = 1 exists in all 3 micro-partitions

Snowflake scans 3 partitions

After Clustering (CLUSTER BY DEPT_ID)

Visual Explanation (After Clustering)

ALTER TABLE EMPLOYEE CLUSTER BY (DEPT_ID);

Micro-Partition 1        Micro-Partition 2        Micro-Partition 3
---------------------------------------------------------------
EMP 1 (D1)              EMP 2 (D2)              EMP 4 (D3)
EMP 3 (D1)              EMP 5 (D2)
EMP 6 (D1)

What Happens

Data is physically reordered
Similar DEPT_ID values are co-located
Partition pruning improves

✅ Result:

Query scans only 1 micro-partition
Faster response
Lower compute cost

6️⃣ What is Clustering Depth?

Definition

Clustering Depth measures data overlap across micro-partitions for a given column or clustering key.

Lower depth = better clustering
Higher depth = more overlap → more scanning

Simple Understanding

ScenarioClustering DepthEmpty table0Perfect clustering~1Heavy overlapHigh value

📌 Snowflake tries to keep depth low, but DML operations disturb it over time.

❌ High Clustering Depth (Bad)

Left side – Bad clustering (High Clustering Depth) DEPT_ID = 1 is scattered across multiple micro-partitions. When you query DEPT_ID = 1, Snowflake must scan many partitions, so performance is slower.
Right side – Good clustering (Low Clustering Depth) DEPT_ID = 1 is stored together in a single micro-partition. The same query scans only one partition, giving faster results and lower cost.

7️⃣ How to Check Clustering Health

Snowflake provides system functions:

Average Clustering Depth

SELECT SYSTEM$CLUSTERING_DEPTH('EMPLOYEE', 'DEPT_ID');

4.28

Meaning

On average, each DEPT_ID value spans ~4 micro-partitions
Lower is better
Ideal values:

Detailed Clustering Info

SELECT SYSTEM$CLUSTERING_INFORMATION('EMPLOYEE', '(DEPT_ID)');

{

"cluster_by_keys" : "DEPT_ID",

"total_partition_count" : 73,

"total_constant_partition_count" : 1,

"average_overlaps" : 33.7808,

"average_depth" : 23.3836,

"partition_depth_histogram" : {

"00000" : 0,

"00001" : 1,

"00002" : 0,

"00003" : 0,

"00004" : 0,

"00005" : 0,

"00006" : 0,

"00007" : 0,

"00008" : 0,

"00009" : 0,

"00010" : 0,

"00011" : 0,

"00012" : 0,

"00013" : 0,

"00014" : 0,

"00015" : 0,

"00016" : 0,

"00032" : 72

"clustering_errors" : [ ]

}

"cluster_by_keys" : "DEPT_ID"

The table is clustered using DEPT_ID
Snowflake tries to physically group rows with the same DEPT_ID
All pruning and depth calculations are based on this column

"total_partition_count" : 73

The EMPLOYEE table is split into 73 micro-partitions
This is the total storage blocks Snowflake scans from
More partitions = more opportunity for pruning (if clustering is good)

"average_overlaps" : 33.7808

On average, each DEPT_ID value appears in ~34 micro-partitions
This is a red flag 🚩
High overlap means:

"average_depth" : 23.3836

This is the Clustering Depth
Meaning:

WHERE DEPT_ID = 10--Snowflake scans ~23 micro-partitions on average

This table will benefit from reclustering or redesign:

ALTER TABLE EMPLOYEE RECLUSTER;

Healthy table ✅

Low clustering depth
Low overlaps
Queries scan fewer MPs
Lower warehouse cost

Unhealthy table ❌

Depth > 3
Frequent full-table scans
High query cost
Needs:

One-line mental model (easy to remember)

Good clustering = same DEPT_ID stays together

8️⃣ How to Choose a Good Cluster Key

Best Practices

✅ Choose columns:

Frequently used in WHERE filters
Used in JOIN conditions
With medium cardinality

❌ Avoid:

Very low cardinality (e.g. gender)
Very high cardinality (e.g. nanosecond timestamps)

Multi-Column Cluster Key Rules

Max 3–4 columns
Order columns:

Example:

CLUSTER BY (DEPT_ID, JOINING_DATE)

9️⃣ Manual vs Automatic Clustering

Snowflake clustering is fully automatic
Runs as a background process
Compute tracked under:

You only:

Define the cluster key
Monitor cost vs performance

🔟 What Triggers Re-Clustering?

INSERT
UPDATE
DELETE
MERGE
COPY

Snowflake decides when to recluster using internal logic No fixed schedule No query downtime during reclustering

Cost of Table Clustering

1️⃣ Compute Cost

Credits consumed by AUTOMATIC_CLUSTERING warehouse
Charged based on actual usage

2️⃣ Storage Cost

Reclustering creates new micro-partitions
Old partitions remain until cleanup
More frequent DML → higher storage usage

Final Words:

Micro-partitioning is automatic & core to Snowflake
Clustering improves query performance via pruning
Use clustering only when queries slow down
Always balance performance gains vs cost
Query performance is the best indicator of good clustering

Refer-https://community.snowflake.com/s/article/understanding-micro-partitions-and-data-clustering