Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs)
Published: None
Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs)
RDDs are the main logical data units in Spark. They are a distributed collection of objects, which are stored in memory or on disks of different machines of a cluster. A single RDD can be divided into multiple logical partitions so that these partitions can be stored and processed on different machines of a cluster.
RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD.
An RDD in Spark can be cached and used again for future transformations, which is a huge benefit for users. RDDs are said to be lazily evaluated, i.e., they delay the evaluation until it is really needed. This saves a lot of time and improves efficiency.
Features of an RDD in Spark
Here are some features of RDD in Spark:
- Resilience: RDDs track data lineage information to recover lost data, automatically on failure. It is also called fault tolerance.
- Distributed: Data present in an RDD resides on multiple nodes. It is distributed across different nodes of a cluster.
- Lazy evaluation: Data does not get loaded in an RDD even if you define it. Transformations are actually computed when you call action, such as count or collect, or save the output to a file system.
Operations on RDDs
There are two basic operations that can be done on RDDs. They are transformations and actions.
Transformations
It accept the existing RDDs as input and output one or more RDDs. However, the data in the existing RDD in Spark does not change as it is immutable.
Actions
Actions in Spark are functions that return the end result of RDD computations. It uses a lineage graph to load data onto the RDD in a particular order. After all of the transformations are done, actions return the final result to the Spark Driver.
Comments
Post a Comment