Spark DataSet - (Object) Encoder

Card Puncher Data Processing


To define a dataset Object, an encoder is required.

It is used to tell Spark to generate code at runtime to serialize the object into a binary structure.

This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format).

The encoder maps the object type T

to Spark's internal type system.


Given a class Person with two fields, name (string) and age (int)

Dataset<Person> people ="...").as(Encoders.bean(Person.class)); 



  • Scala Encoders are generally created automatically through implicits from a SparkSession, or can be explicitly created by calling static methods on Encoders.
import spark.implicits._
val ds = Seq(1, 2, 3).toDS() // implicitly provided (spark.implicits.newIntEncoder)
  • Java: Encoders are specified by calling static methods on Encoders.
List<String> data = Arrays.asList("abc", "abc", "xyz");
Dataset<String> ds = context.createDataset(data, Encoders.STRING());



Encoders can be composed into tuples (logical row):

Encoder<Tuple2<Integer, String>> encoder2 = Encoders.tuple(Encoders.INT(), Encoders.STRING());
List<Tuple2<Integer, String>> data2 = Arrays.asList(new scala.Tuple2(1, "a");
Dataset<Tuple2<Integer, String>> ds2 = context.createDataset(data2, encoder2);

Java Beans

constructed from Java Beans:


Documentation / Reference

Discover More
Card Puncher Data Processing
Spark DataSet - Map

With java, you need to give the object encoder Scala do the cast implicitly.
Card Puncher Data Processing
Spark DataSet - Object

A dataset is a data set of specific object. To define this domain Specific Object, an encoder is required. Scala Java where To understand the internal binary representation for data,...

Share this page:
Follow us:
Task Runner