web:robot:crawler

Table of Contents

What is a web crawler ?

A web crawler is an crawler application that reads web resources (mostly a web page) and parse them to extract meaningful information.

Articles Related

Steps

A crawl cycle consists of 4 steps:

Selects the urls to fetch
- All urls are partitioned by domain, host or IP. This means that all urls from the same domain (host, IP) end up in the same partition and will be handled by the same (reduce) task. Within each partition all urls are sorted by score (best first).
- A maximum of topN urls gets selected.
Fetch: Fetches all selected urls
Parse: Parses all webpages. (scrape)
Persist: Persist the parse output in a database

Crawler needs to respect the rate limiting configuration.

Implementation

Crawler are build with a headless browser library

Example:

List

Documentation / Reference

http://wiki.apache.org/nutch/Nutch2Crawling