Table of Contents

Web - Robots.txt

About

robots.txt is a file that control and gives permission to Web Bot when they crawl your website.

Example

Disallow all

User-agent: *    # applies to all robots
Disallow: /      # disallow indexing of all pages

Disallow a subdirectory

# Group 1
User-agent: Googlebot
Disallow: /nogooglebot/

# Group 2
User-agent: *
Allow: /

Sitemap: http://www.example.com/sitemap.xml

Delay between page view

The crawl delay is the number of seconds the bot should wait between pageview

It does not represent a crawl rate.

Instead, it defines the size of a time window (from 1 to 30 seconds) during which the Bot will crawl your website only once.

For example, if your crawl delay is 5, you will get 17,280 crawled.

<MATH> 24 hour * 60 minutes * 60 minutes / 5 seconds = 17,280 pages </MATH>

Example:

User-agent: *
Disallow: 
Allow: /*

Crawl-delay: 5

You can disallow the use of a parameter in a URL query

Example: disallow the do=search for crawler.

User-agent: *
Disallow: /*?*do=search*

Syntax

The information below comes from:

Disallow

An empty value for “Disallow”, indicates that all URIs can be retrieved. At least one “Disallow” field must be present in the robots.txt file.

The “Disallow” field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,

Disallow: /help 

Disallow: /help/ 

Pattern

The pattern allows the following characters 4) and is matched against:

Character
$ Designates the end of the match pattern. A URI MUST end with a $.
* Designates 0 or more instances of any character
# wildcard
allow: /this/*/exactly
# end of uri
allow: /this/path/exactly$

It the pattern should use one of the above character, it should be % encoded. For example:

Pattern URI blocked
/path/file-with-a-%2A.html https://www.example.com/path/file-with-a-*.html
/path/foo-%24 https://www.example.com/path/foo-$

Location

The location of the robots.txt is at the root of the endpoint. For instance: https://datacadamia.com/robots.txt

Test

You can test a robots.txt with the robots-testing-tool of google.

5) 6)