robots.txt is a file that control and gives permission to Web Bot when they crawl your website.
User-agent: * # applies to all robots
Disallow: / # disallow indexing of all pages
# Group 1
User-agent: Googlebot
Disallow: /nogooglebot/
# Group 2
User-agent: *
Allow: /
Sitemap: http://www.example.com/sitemap.xml
The crawl delay is the number of seconds the bot should wait between pageview
It does not represent a crawl rate.
Instead, it defines the size of a time window (from 1 to 30 seconds) during which the Bot will crawl your website only once.
For example, if your crawl delay is 5, you will get 17,280 crawled.
<MATH> 24 hour * 60 minutes * 60 minutes / 5 seconds = 17,280 pages </MATH>
Example:
User-agent: *
Disallow:
Allow: /*
Crawl-delay: 5
You can disallow the use of a parameter in a URL query
Example: disallow the do=search for crawler.
User-agent: *
Disallow: /*?*do=search*
The information below comes from:
An empty value for “Disallow”, indicates that all URIs can be retrieved. At least one “Disallow” field must be present in the robots.txt file.
The “Disallow” field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,
Disallow: /help
Disallow: /help/
The pattern allows the following characters 4) and is matched against:
Character | |
---|---|
$ | Designates the end of the match pattern. A URI MUST end with a $. |
* | Designates 0 or more instances of any character |
# wildcard
allow: /this/*/exactly
# end of uri
allow: /this/path/exactly$
It the pattern should use one of the above character, it should be % encoded. For example:
Pattern | URI blocked |
---|---|
/path/file-with-a-%2A.html | https://www.example.com/path/file-with-a-*.html |
/path/foo-%24 | https://www.example.com/path/foo-$ |
The location of the robots.txt is at the root of the endpoint. For instance: https://datacadamia.com/robots.txt
You can test a robots.txt with the robots-testing-tool of google.