Spark - Resilient Distributed Datasets (RDDs)
About
Resilient distributed datasets are one of the data structure in Spark.
- Write programs in terms of operations on distributed datasets
- Partitioned collections of objects spread across a cluster, stored in memory or on disk
- RDDs built and manipulated through a diverse set of parallel transformations (map, filter, join) and actions (count, collect, save)
- RDDs automatically rebuilt on machine failure
RDDs cannot be changed once they are created - they are immutable. You can create RDDs by applying transformations to existing RDDs and Spark automatically tracks how you create and manipulate RDDs (their lineage) so that it can reconstruct any data that is lost due to slow or failed machine. Operations on RDDs are performed in parallel.
- You cannot change IT. (They are immutable once you create an RDD.)
- You can transform it.
- You can perform actions on it.
Spark tracks lineage information to enable the efficient recomputation of any lost data if a machine should fail or crash.
RDDs can contain any type of data. See Spark RDD - (Creation|Construction|Initialization)
Spark enables operations on collections of elements in parallel.
You paralyze existent Python collections such as lists by using RDDs.
- distributed because the data may be distributed across multiple executors in the cluster.
- resilient because the lineage of the data is preserved and, therefore, the data can be re-created on a new node at any time. Lineage is the sequence of operations that
was applied to the base data set.
Articles Related
Script
An RDD script
The action (collect) causes the transformation (parallelize, filter, and map) to be executed. An RDD follows a builder pattern where the last method sometimes build an object or simply perform an action.
When you perform transformations and actions that use functions, Spark will automatically push a closure containing that function to the workers so that it can run at the workers.
Creation
In you driver program, you create an RDD from:
- a file
- or from a collection (parallelize a collection)
Transformation
You then specify transformations to that RDD. They will lazily create new RDDs (without applying immediately the transformation)
Spark remembers the set of transformations that are applied to a base data set. It can then optimize the required calculations and automatically recover from failures and slow workers.
Spark transformations create new data sets from an existing one.
Cache
We can cache some RDDs for reuse
Action
We perform actions that are executed in parallel on the RDD.
Examples are collect and count.
Loop / Show
Example on a row
rowJavaRDD.take(10).forEach(x -> {
for (int i = 0; i < x.size(); i++) {
System.out.print(x.get(i) + ",");
}
System.out.println();
});
Property
Internally, each RDD is characterized by five main properties:
- A list of partitions
- A function for computing each split
- A list of dependencies on other RDDs
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
Partition
The number of partitions is an RDD property.
Operations
There are two types of operations you can perform on an RDD:
- and actions.
You can also persist, or cache, RDDs in memory or on disk.
An RDD follows a builder pattern where the last method performs an action whereas the previous methods just set parameters called transformation.
Documentation / Reference
- Java RDD class contains the basic operations available on all RDDs, such as map, filter, and persist.