Web Robot - Crawler

What is a web crawler ?

A web crawler is an crawler application that reads web resources (mostly a web page) and parse them to extract meaningful information.

Steps

A crawl cycle consists of 4 steps:

  • Selects the urls to fetch
    • All urls are partitioned by domain, host or IP. This means that all urls from the same domain (host, IP) end up in the same partition and will be handled by the same (reduce) task. Within each partition all urls are sorted by score (best first).
    • A maximum of topN urls gets selected.
  • Parse: Parses all webpages. (scrape)
  • Persist: Persist the parse output in a database

Crawler needs to respect the rate limiting configuration.

Implementation

List

Documentation / Reference


Powered by ComboStrap