Pig is:
Pig provides:
Not a pure relational data model. “Schema-on-Read” rather than “Schema-on-write”
Improvements on the Pig language have made it often just as efficient as writing the code in Map Reduce.
You can work with (native|in situ) data.
Pipeline are performed on collections of Tuples
In Pig Latin
Users = load ‘users’ as (name, age);
Fltrd = filter Users by age >= 18 and age <= 25;
Pages = load ‘Activity Data’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into 'top5sites’;
No work is done until STORE is called because of lazy evaluation.
Reduce the plan to a minimum of Map Reduce jobs because they are expensive.