Web Robot - Crawler
What is a web crawler ?
A web crawler is an crawler application that reads web resources (mostly a web page) and parse them to extract meaningful information.
Articles Related
Steps
A crawl cycle consists of 4 steps:
- Selects the urls to fetch
- All urls are partitioned by domain, host or IP. This means that all urls from the same domain (host, IP) end up in the same partition and will be handled by the same (reduce) task. Within each partition all urls are sorted by score (best first).
- A maximum of topN urls gets selected.
- Fetch: Fetches all selected urls
- Parse: Parses all webpages. (scrape)
- Persist: Persist the parse output in a database
Crawler needs to respect the rate limiting configuration.
Implementation
Crawler are build with a headless browser library
Example:
List
- …