Consumer Analytics - Event Collector


A collector collects event send by a tracker

The event and data send are describe in a measurement protocol


Data aggregation refers to techniques for gathering individual data records (for example log records) and combining them into a large bundle of data files.

Why aggregation ? Before processing your data, Hadoop splits your data (files) into multiple chunks. After splitting the file(s), a single map task processes each part. If you are using HDFS as the underlying data storage, the HDFS framework has already separated the data files into multiple blocks. In addition, since your data is fragmented, Hadoop uses HDFS data blocks to assign a single map task to each HDFS block.


A collector would implement this two HTTP method



User-Agent: user_agent_string


GET /collect?payload_data HTTP/1.1
User-Agent: user_agent_string

A Get request can be triggered:

To avoid to hit a cached HTTP GET requests, a collector should provide a cache burster (ie a special parameter that can be set with a random number in order to ensure that all requests are unique, and that subsequent requests are not retrieved from the cache).

Example from Google Analytics with the z parameter:

Error handling

If you do not get a 2xx status code, you should NOT retry the request. Instead, you should stop and correct any errors in your HTTP request.


Documentation / Reference

  • AWS_Amazon_EMR_Best_Practices.pdf

Powered by ComboStrap