Unlocking Performance: Best Practices for Amazon Redshift Table Design
Unlocking Performance: Best Practices for Amazon Redshift Table Design
Published: None
Unlocking Performance: Best Practices for Amazon Redshift Table Design
1. Nodes and Slices:
- An Amazon Redshift cluster is a set of nodes. Each node in the cluster has its own operating system, dedicated memory, and dedicated disk storage. One node is the leader node, which manages the distribution of data and query processing tasks to the compute nodes. The compute nodes provide resources to do those tasks.
- The leader node manages data distribution and query processing tasks & Compute nodes execute these tasks.
- Disk storage on a compute node is divided into slices.
- The number of slices per node varies based on the node size.
- Nodes work in parallel, distributing data evenly across slices for efficient query processing.
Designing Amazon Redshift
Data Redistribution:
- When loading data into a table, Amazon Redshift distributes rows to node slices based on the table’s distribution style.
- During query execution, the optimizer decides the optimal data location, leading to data redistribution.
- Redistribution involves moving specific rows to nodes for joins or broadcasting entire tables to all nodes.
- This process can significantly impact query cost and generate network traffic, affecting overall system performance.
- Minimizing data redistribution by strategically locating data initially can enhance performance.
Data Distribution Goals:
1. Uniform Workload Distribution:
- Aim to distribute the workload evenly among cluster nodes.
- Avoid data distribution skew, which causes some nodes to work harder, impairing query performance.
2. Minimize Data Movement:
- Ensure rows involved in joins or aggregates are collocated on the same nodes.
- This reduces the need for data redistribution during query execution, optimizing performance.
Distribution styles
- When you create a table, you can designate one of the following distribution styles: AUTO, EVEN, KEY, or ALL.
- If you don't specify a distribution style, Amazon Redshift uses AUTO distribution.
Redshift distribution styles: Auto
- With AUTO distribution, Amazon Redshift optimally assigns a distribution style based on table size: starting with ALL for small tables, switching to KEY as the table grows (using the primary key or a composite key column), and eventually changing to EVEN if no suitable distribution key is found. These adjustments occur in the background with minimal impact on user queries.
Redshift distribution styles: All
- ‘All’ is the simplest distribution style. If you set a distribution style of ‘All’, you instruct Redshift to simply make a copy of the table to every node in the cluster.
- In Amazon Redshift, the ALL distribution style ensures that a copy of the entire table is distributed to every node in the cluster.
- This approach is beneficial for joins, as every row is collocated. However, it significantly increases storage requirements and the time needed for data operations.
- Therefore, it’s best suited for static tables that aren’t frequently updated.
- For small dimension tables, the benefit of using ALL distribution is minimal due to the low cost of redistributing them during queries.
Redshift distribution styles: Even
- ‘Even’ – Specifying ‘Even’ distribution spreads the table rows over all the nodes in the cluster, well, evenly!
- The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.
- EVEN distribution is appropriate when a table doesn't participate in joins. It's also appropriate when there isn't a clear choice between KEY distribution and ALL distribution.
Redshift distribution styles: Key
- ‘Key’ – With a key distribution set, you specify a column to distribute on and then, cleverly, AWS Redshift ensures that all the rows with the same value of that key are placed on the same node.
- The rows are distributed according to the values in one column.
- The leader node places matching values on the same node slice.
- If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns.
- This way, matching values from the common columns are physically stored together.
Distribution strategy:
- ALL Distribution: Used for smaller dimension tables (e.g., a date dimension with a few thousand entries).
- EVEN Distribution: Applied to tables not joined with others or only joined to tables with ALL distribution (e.g., a fact table joined to small dimensions).
- KEY Distribution: For very large dimensions, both the dimension and associated fact tables are distributed on their join column. If there’s a second large dimension, you either:Take the storage hit and distribute ALL.Design the dimension columns into the fact table.
Thanks for reading
Reference:
Comments
Post a Comment