Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Published: None

Source: https://www.linkedin.com/pulse/resilient-distributed-datasets-rdds-arabinda-mohapatra-5phef?trackingId=fJ3OTm4OQfCQnxnrWneABA%3D%3D


Resilient Distributed Datasets (RDDs)

Running Kafka streams after dark, diving into genetic code by daylight, and wrestling with Databricks and Tableflow in every spare moment—sleep is optional

Resilient Distributed Datasets (RDDs)

RDDs are the main logical data units in Spark. They are a distributed collection of objects, which are stored in memory or on disks of different machines of a cluster. A single RDD can be divided into multiple logical partitions so that these partitions can be stored and processed on different machines of a cluster.

RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD.

Article content
RDD

An RDD in Spark can be cached and used again for future transformations, which is a huge benefit for users. RDDs are said to be lazily evaluated, i.e., they delay the evaluation until it is really needed. This saves a lot of time and improves efficiency.


Features of an RDD in Spark

Here are some features of RDD in Spark:

  • Resilience: RDDs track data lineage information to recover lost data, automatically on failure. It is also called fault tolerance.
  • Distributed: Data present in an RDD resides on multiple nodes. It is distributed across different nodes of a cluster.
  • Lazy evaluation: Data does not get loaded in an RDD even if you define it. Transformations are actually computed when you call action, such as count or collect, or save the output to a file system.

Article content

Operations on RDDs

There are two basic operations that can be done on RDDs. They are transformations and actions.

Article content

Transformations

It accept the existing RDDs as input and output one or more RDDs. However, the data in the existing RDD in Spark does not change as it is immutable.

Actions

Actions in Spark are functions that return the end result of RDD computations. It uses a lineage graph to load data onto the RDD in a particular order. After all of the transformations are done, actions return the final result to the Spark Driver.


Comments