Web - Robots.txt

About

robots.txt is a file that control and gives permission to Web Bot when they crawl your website.

Example

Disallow all

User-agent: *    # applies to all robots
Disallow: /      # disallow indexing of all pages

Disallow a subdirectory

# Group 1
User-agent: Googlebot
Disallow: /nogooglebot/

# Group 2
User-agent: *
Allow: /

Sitemap: http://www.example.com/sitemap.xml

Googlebot should not crawl http://www.example.com/nogooglebot and all sub directory
All other user agents can access the entire site.
sitemap.xml

Delay between page view

The crawl delay is the number of seconds the bot should wait between pageview

It does not represent a crawl rate.

Instead, it defines the size of a time window (from 1 to 30 seconds) during which the Bot will crawl your website only once.

For example, if your crawl delay is 5, you will get 17,280 crawled.

<MATH> 24 hour * 60 minutes * 60 minutes / 5 seconds = 17,280 pages </MATH>

Example:

User-agent: *
Disallow: 
Allow: /*

Crawl-delay: 5

Disallow a query search

You can disallow the use of a parameter in a URL query

Example: disallow the do=search for crawler.

User-agent: *
Disallow: /*?*do=search*

Syntax

The information below comes from:

the Robots Exclusion Protocol ¹⁾
the HTML 4.01 Specification ²⁾
the Google Robot Txt documentation ³⁾

Disallow

An empty value for “Disallow”, indicates that all URIs can be retrieved. At least one “Disallow” field must be present in the robots.txt file.

The “Disallow” field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,

to disallows both /help.html and /help/index.html

Disallow: /help

to disallow /help/index.html but allow /help.html.

Disallow: /help/

Pattern

The pattern allows the following characters ⁴⁾ and is matched against:

the path
and the query string

Character
$	Designates the end of the match pattern. A URI MUST end with a $.
*	Designates 0 or more instances of any character

# wildcard
allow: /this/*/exactly
# end of uri
allow: /this/path/exactly$

It the pattern should use one of the above character, it should be % encoded. For example:

Pattern	URI blocked
/path/file-with-a-%2A.html	https://www.example.com/path/file-with-a-*.html
/path/foo-%24	https://www.example.com/path/foo-$

Location

The location of the robots.txt is at the root of the endpoint. For instance: https://datacadamia.com/robots.txt

Test

You can test a robots.txt with the robots-testing-tool of google.

⁵⁾ ⁶⁾

¹⁾

Robots Exclusion Protocol draft-koster-rep-06

²⁾

HTML 4.01 Specification - The robots.txt file

³⁾

Google Robot Txt

⁴⁾

Pattern Characters

⁵⁾

http://www.robotstxt.org/

⁶⁾

https://www.imperva.com/blog/most-active-good-bots