Resilient distributed datasets are one of the data structure in Spark.
RDDs cannot be changed once they are created - they are immutable. You can create RDDs by applying transformations to existing RDDs and Spark automatically tracks how you create and manipulate RDDs (their lineage) so that it can reconstruct any data that is lost due to slow or failed machine. Operations on RDDs are performed in parallel.
Spark tracks lineage information to enable the efficient recomputation of any lost data if a machine should fail or crash.
RDDs can contain any type of data. See Spark RDD - (Creation|Construction|Initialization)
Spark enables operations on collections of elements in parallel.
You paralyze existent Python collections such as lists by using RDDs.
was applied to the base data set.
An RDD script
The action (collect) causes the transformation (parallelize, filter, and map) to be executed. An RDD follows a builder pattern where the last method sometimes build an object or simply perform an action.
When you perform transformations and actions that use functions, Spark will automatically push a closure containing that function to the workers so that it can run at the workers.
In you driver program, you create an RDD from:
You then specify transformations to that RDD. They will lazily create new RDDs (without applying immediately the transformation)
Spark remembers the set of transformations that are applied to a base data set. It can then optimize the required calculations and automatically recover from failures and slow workers.
Spark transformations create new data sets from an existing one.
We can cache some RDDs for reuse
We perform actions that are executed in parallel on the RDD.
Examples are collect and count.
Example on a row
rowJavaRDD.take(10).forEach(x -> {
for (int i = 0; i < x.size(); i++) {
System.out.print(x.get(i) + ",");
}
System.out.println();
});
Internally, each RDD is characterized by five main properties:
The number of partitions is an RDD property.
There are two types of operations you can perform on an RDD:
You can also persist, or cache, RDDs in memory or on disk.
An RDD follows a builder pattern where the last method performs an action whereas the previous methods just set parameters called transformation.