About
A collector collects event send by a tracker
The event and data send are describe in a measurement protocol
Articles Related
Aggregation
Data aggregation refers to techniques for gathering individual data records (for example log records) and combining them into a large bundle of data files.
Why aggregation ? Before processing your data, Hadoop splits your data (files) into multiple chunks. After splitting the file(s), a single map task processes each part. If you are using HDFS as the underlying data storage, the HDFS framework has already separated the data files into multiple blocks. In addition, since your data is fragmented, Hadoop uses HDFS data blocks to assign a single map task to each HDFS block.
Protocol
A collector would implement this two HTTP method
Post
User-Agent: user_agent_string
POST https://www.example.com/collect
payload_data
Get
GET /collect?payload_data HTTP/1.1
Host: https://www.example.com
User-Agent: user_agent_string
A Get request can be triggered:
To avoid to hit a cached HTTP GET requests, a collector should provide a cache burster (ie a special parameter that can be set with a random number in order to ensure that all requests are unique, and that subsequent requests are not retrieved from the cache).
Example from Google Analytics with the z parameter: https://www.example.com/collect?payload_data&z=123456
Error handling
If you do not get a 2xx status code, you should NOT retry the request. Instead, you should stop and correct any errors in your HTTP request.
List
Documentation / Reference
- AWS_Amazon_EMR_Best_Practices.pdf