Partitioning and Bucketing in Apache Spark

Published: None

Source: https://www.linkedin.com/pulse/partitioning-bucketing-apache-spark-arabinda-mohapatra-qzpaf?trackingId=XFuzrPv8SFqHvSecoJC58A%3D%3D

Partitioning and Bucketing in Apache Spark

Arabinda Mohapatra

Running Kafka streams after dark, diving into genetic code by daylight, and wrestling with Databricks and Tableflow in every spare moment—sleep is optional

August 28, 2024

Partitioning and Bucketing in Apache Spark

Partitioning and bucketing are two powerful techniques in Apache Spark that help optimize data processing and query performance. Here’s a detailed look at both methods and when to use them.

Partitioning in Spark

Partitioning splits data into separate folders on disk based on one or multiple columns. This enables efficient parallelism and partition pruning, which optimizes queries by skipping unnecessary data.

Implementation:

Partitioning is done using the .partitionBy() method of the DataFrameWriter class. You need to specify the columns to partition by, and Spark will save each partition in a separate folder on disk. The number of resulting files is controlled by the spark.sql.shuffle.partitions setting.

# Create a SparkSession spark = SparkSession.builder.appName("PartitioningExample").getOrCreate()

# Load a dataset df = spark.read.format("csv").option("header", "true").load("path/to/dataset")

# Partition the dataset by the "date" column df.write.partitionBy("date").format("parquet").save("path/to/partitioned/dataset")

In this example, the dataset is partitioned by the “date” column and saved as a Parquet file.

Bucketing in Spark

Bucketing assigns rows to specific buckets and collocates them on disk, which is useful for wide transformations like joins and aggregations. Bucketing reduces the need for shuffling data across partitions.

Implementation:

Bucketing is done using the .bucketBy() method of the DataFrameWriter class. You need to specify the number of buckets and the column to bucket by. The bucket number is calculated using a hash function on the bucket column.

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder.appName("BucketingExample").getOrCreate()

# Load a dataset

df = spark.read.format("csv").option("header", "true").load("path/to/dataset")

# Bucket the dataset by the "id" column into 10 buckets

No of bucket= Total dataset size/default block size
default block size=128 MB
Total dataset size= total no of rorecords*variable*datatype
variable=no of columns

df.write.bucketBy(10, "id").sortBy("id").format("parquet").save("path/to/bucketed/dataset")

When to Use Partitioning and Bucketing

Partitioning: Use partitioning when you frequently filter on a column with low cardinality. This helps in skipping unnecessary data and speeds up query performance.
Bucketing: Use bucketing for complex operations like joins, groupBys, and windowing on columns with high cardinality. Bucketing helps in reducing shuffling and sorting costs.

Search This Blog