Resilient Distributed Datasets (RDDs)

Published: None

Source: https://www.linkedin.com/pulse/resilient-distributed-datasets-rdds-arabinda-mohapatra-5phef?trackingId=fJ3OTm4OQfCQnxnrWneABA%3D%3D

Resilient Distributed Datasets (RDDs)

Arabinda Mohapatra

Running Kafka streams after dark, diving into genetic code by daylight, and wrestling with Databricks and Tableflow in every spare moment—sleep is optional

March 18, 2024

Resilient Distributed Datasets (RDDs)

RDDs are the main logical data units in Spark. They are a distributed collection of objects, which are stored in memory or on disks of different machines of a cluster. A single RDD can be divided into multiple logical partitions so that these partitions can be stored and processed on different machines of a cluster.

RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD.

An RDD in Spark can be cached and used again for future transformations, which is a huge benefit for users. RDDs are said to be lazily evaluated, i.e., they delay the evaluation until it is really needed. This saves a lot of time and improves efficiency.

Features of an RDD in Spark

Here are some features of RDD in Spark:

Resilience: RDDs track data lineage information to recover lost data, automatically on failure. It is also called fault tolerance.
Distributed: Data present in an RDD resides on multiple nodes. It is distributed across different nodes of a cluster.
Lazy evaluation: Data does not get loaded in an RDD even if you define it. Transformations are actually computed when you call action, such as count or collect, or save the output to a file system.

Operations on RDDs

There are two basic operations that can be done on RDDs. They are transformations and actions.

Transformations

It accept the existing RDDs as input and output one or more RDDs. However, the data in the existing RDD in Spark does not change as it is immutable.

Actions

Actions in Spark are functions that return the end result of RDD computations. It uses a lineage graph to load data onto the RDD in a particular order. After all of the transformations are done, actions return the final result to the Spark Driver.

Search This Blog

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Arabinda Mohapatra

Resilient Distributed Datasets (RDDs)

Features of an RDD in Spark

Operations on RDDs

Transformations

Actions

Comments

Post a Comment

Hi, I'm Arabinda Mohapatra