Web - Robots (Wanderers | Crawlers | Spiders)

1 - About

Bot in a web context.

Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are crawler program that scan the web generally

  • in order to:
    • create an search engine. See Search Engine - Bot
  • but the majority of them have malicious intents (trying to steal, penetrate, misuse your web application) see Security - Bad Bot (Spambot, ...))

More than half of all web traffic is made up of bots.

3 - Example of report

Example of bots agent report on this website.

where: user-agent is the user agent given by the robot.

3.1 - Implementation

bot are generally implemented with a headless browser library.

4 - Management

4.1 - List

There is a lot of bot out there.

See:

4.2 - Configuration

4.2.1 - Robots.txt

4.2.2 - Meta

The meta name=“ROBOTS” tell visiting robots whether a document may be indexed, or used to harvest more links.

In the following meta example a robot should neither index this document, nor analyze it for links.


<META name="ROBOTS" content="NOINDEX, NOFOLLOW">

The list of terms in the content is:

Specific to a bot: googlebot cannot index for instance


<meta name="robots" content="nofollow">
<meta name="googlebot" content="noindex">

the If-Modified-Since HTTP header tell Crawler if the content has changed since the last crawl. Supporting this feature saves bandwidth and overhead.

4.3 - Inspect

4.3.1 - Rendering

See how google Bot see you website at GoogleBot rendering

4.4 - Rendering

For Bot that can not render a javascript dynamic web page (PWA), you can pre-render it with puppeteer. pupperender

4.5 - Test

A simple test based on Javascript - Regular expression (Regexp) based on their user agent string


bots = /bot|crawler|spider|crawling/i;
let isBot = bots.test(navigator.userAgent);
if (!isBot) {
  console.log('This agent is not a bot ('+navigator.userAgent+')' );
}

4.6 - Mobile first


Data Science
Data Analysis
Statistics
Data Science
Linear Algebra Mathematics
Trigonometry

Powered by ComboStrap