🧊 Snowflake Micro-Partitions & Table Clustering
🧊 Snowflake Micro-Partitions & Table Clustering
Published: None
🧊 Snowflake Micro-Partitions & Table Clustering
1️⃣ What are Snowflake Micro-Partitions?
- Snowflake automatically divides tables into micro-partitions
- You do NOT define partitions manually
- Micro-partition = smallest unit of storage in Snowflake
Key Characteristics
- Size: 50–500 MB (uncompressed)
- Stored in columnar format
- Automatically compressed
- Created based on data insertion order (natural clustering)
- A table can have millions of micro-partitions
What metadata Snowflake stores per micro-partition?
- Min & Max values for each column
- Number of distinct values
- Other optimization statistics
📌 This metadata enables partition pruning (skip unnecessary data during query execution)
2️⃣ Why Micro-Partitions are Powerful
Benefits
- ✅ No manual partition management
- ✅ Very fine-grained pruning → faster queries
- ✅ Avoids data skew (small, uniform partitions)
- ✅ Columnar storage → scans only required columns
- ✅ Column-level compression (automatic & optimized)
3️⃣ Natural Clustering (Default Behavior)
- Data is stored in the order it is loaded
- This is called natural clustering
- Works well only if query filters match insert order
Example:
- If data is inserted by EMP_ID
- Queries filtering on EMP_ID → good performance
- Queries filtering on other columns → poor pruning
4️⃣ What is Table Clustering?
Definition
Table clustering physically reorganizes data so similar values are stored closer together across micro-partitions.
- Uses a CLUSTER BY key
- Improves partition pruning
- Reduces scan cost
- Improves query performance
5️⃣ Why Clustering is Needed (Real Example)
Example Table: EMPLOYEE
Concept
- Data is stored in insertion order
- Micro-partitions contain mixed values
- Queries on non-insert columns scan multiple micro-partitions
Columns:
- EMP_ID
- DEPT_ID
🔹 Before Clustering (Natural Order: EMP)
| EMP_ID | DEPT_ID |
| ------ | ------- |
| 1 | 1 |
| 2 | 2 |
| 3 | 1 |
| 4 | 3 |
| 5 | 2 |
| 6 | 1 |
Micro-Partition 1 Micro-Partition 2 Micro-Partition 3
---------------------------------------------------------------
EMP 1 (D1) EMP 3 (D1) EMP 5 (D2)
EMP 2 (D2) EMP 4 (D3) EMP 6 (D1)
Query:
🔴 Query: WHERE DEPT_ID = 1
❌ Problem:
DEPT_ID = 1 exists in all 3 micro-partitions
Snowflake scans 3 partitions
After Clustering (CLUSTER BY DEPT_ID)
- Visual Explanation (After Clustering)
ALTER TABLE EMPLOYEE CLUSTER BY (DEPT_ID);
Micro-Partition 1 Micro-Partition 2 Micro-Partition 3
---------------------------------------------------------------
EMP 1 (D1) EMP 2 (D2) EMP 4 (D3)
EMP 3 (D1) EMP 5 (D2)
EMP 6 (D1)
What Happens
- Data is physically reordered
- Similar DEPT_ID values are co-located
- Partition pruning improves
✅ Result:
- Query scans only 1 micro-partition
- Faster response
- Lower compute cost
6️⃣ What is Clustering Depth?
Definition
Clustering Depth measures data overlap across micro-partitions for a given column or clustering key.
- Lower depth = better clustering
- Higher depth = more overlap → more scanning
Simple Understanding
ScenarioClustering DepthEmpty table0Perfect clustering~1Heavy overlapHigh value
📌 Snowflake tries to keep depth low, but DML operations disturb it over time.
❌ High Clustering Depth (Bad)
- Left side – Bad clustering (High Clustering Depth) DEPT_ID = 1 is scattered across multiple micro-partitions. When you query DEPT_ID = 1, Snowflake must scan many partitions, so performance is slower.
- Right side – Good clustering (Low Clustering Depth) DEPT_ID = 1 is stored together in a single micro-partition. The same query scans only one partition, giving faster results and lower cost.
7️⃣ How to Check Clustering Health
Snowflake provides system functions:
Average Clustering Depth
SELECT SYSTEM$CLUSTERING_DEPTH('EMPLOYEE', 'DEPT_ID');
- 4.28
Meaning
- On average, each DEPT_ID value spans ~4 micro-partitions
- Lower is better
- Ideal values:
Detailed Clustering Info
SELECT SYSTEM$CLUSTERING_INFORMATION('EMPLOYEE', '(DEPT_ID)');
{
"cluster_by_keys" : "DEPT_ID",
"total_partition_count" : 73,
"total_constant_partition_count" : 1,
"average_overlaps" : 33.7808,
"average_depth" : 23.3836,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 1,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"00032" : 72
},
"clustering_errors" : [ ]
}
"cluster_by_keys" : "DEPT_ID"
- The table is clustered using DEPT_ID
- Snowflake tries to physically group rows with the same DEPT_ID
- All pruning and depth calculations are based on this column
"total_partition_count" : 73
- The EMPLOYEE table is split into 73 micro-partitions
- This is the total storage blocks Snowflake scans from
- More partitions = more opportunity for pruning (if clustering is good)
"average_overlaps" : 33.7808
- On average, each DEPT_ID value appears in ~34 micro-partitions
- This is a red flag 🚩
- High overlap means:
"average_depth" : 23.3836
- This is the Clustering Depth
- Meaning:
WHERE DEPT_ID = 10--Snowflake scans ~23 micro-partitions on average
This table will benefit from reclustering or redesign:
ALTER TABLE EMPLOYEE RECLUSTER;
Healthy table ✅
- Low clustering depth
- Low overlaps
- Queries scan fewer MPs
- Lower warehouse cost
Unhealthy table ❌
- Depth > 3
- Frequent full-table scans
- High query cost
- Needs:
One-line mental model (easy to remember)
Good clustering = same DEPT_ID stays together
8️⃣ How to Choose a Good Cluster Key
Best Practices
✅ Choose columns:
- Frequently used in WHERE filters
- Used in JOIN conditions
- With medium cardinality
❌ Avoid:
- Very low cardinality (e.g. gender)
- Very high cardinality (e.g. nanosecond timestamps)
Multi-Column Cluster Key Rules
- Max 3–4 columns
- Order columns:
Example:
CLUSTER BY (DEPT_ID, JOINING_DATE)
9️⃣ Manual vs Automatic Clustering
- Snowflake clustering is fully automatic
- Runs as a background process
- Compute tracked under:
You only:
- Define the cluster key
- Monitor cost vs performance
🔟 What Triggers Re-Clustering?
- INSERT
- UPDATE
- DELETE
- MERGE
- COPY
Snowflake decides when to recluster using internal logic No fixed schedule No query downtime during reclustering
Cost of Table Clustering
1️⃣ Compute Cost
- Credits consumed by AUTOMATIC_CLUSTERING warehouse
- Charged based on actual usage
2️⃣ Storage Cost
- Reclustering creates new micro-partitions
- Old partitions remain until cleanup
- More frequent DML → higher storage usage
Final Words:
- Micro-partitioning is automatic & core to Snowflake
- Clustering improves query performance via pruning
- Use clustering only when queries slow down
- Always balance performance gains vs cost
- Query performance is the best indicator of good clustering
Refer-https://community.snowflake.com/s/article/understanding-micro-partitions-and-data-clustering
Comments
Post a Comment