Table of Contents

About

A collector collects event send by a tracker

The event and data send are describe in a measurement protocol

Aggregation

Data aggregation refers to techniques for gathering individual data records (for example log records) and combining them into a large bundle of data files.

Why aggregation ? Before processing your data, Hadoop splits your data (files) into multiple chunks. After splitting the file(s), a single map task processes each part. If you are using HDFS as the underlying data storage, the HDFS framework has already separated the data files into multiple blocks. In addition, since your data is fragmented, Hadoop uses HDFS data blocks to assign a single map task to each HDFS block.

Protocol

A collector would implement this two HTTP method

Post

Post

User-Agent: user_agent_string
POST https://www.example.com/collect
payload_data

Get

GET /collect?payload_data HTTP/1.1
Host: https://www.example.com
User-Agent: user_agent_string

A Get request can be triggered:

To avoid to hit a cached HTTP GET requests, a collector should provide a cache burster (ie a special parameter that can be set with a random number in order to ensure that all requests are unique, and that subsequent requests are not retrieved from the cache).

Example from Google Analytics with the z parameter: https://www.example.com/collect?payload_data&z=123456

Error handling

If you do not get a 2xx status code, you should NOT retry the request. Instead, you should stop and correct any errors in your HTTP request.

List

Documentation / Reference

  • AWS_Amazon_EMR_Best_Practices.pdf