Web - Robots.txt
Table of Contents
About
robots.txt is a file that control and gives permission to Web Bot when they crawl your website.
Example
Disallow all
User-agent: * # applies to all robots
Disallow: / # disallow indexing of all pages
Disallow a subdirectory
# Group 1
User-agent: Googlebot
Disallow: /nogooglebot/
# Group 2
User-agent: *
Allow: /
Sitemap: http://www.example.com/sitemap.xml
- Googlebot should not crawl http://www.example.com/nogooglebot and all sub directory
- All other user agents can access the entire site.
Delay between page view
The crawl delay is the number of seconds the bot should wait between pageview
Example:
User-agent: *
Disallow:
Allow: /*
Crawl-delay: 5
Disallow a query search
You can disallow the use of a parameter in a URL query
Example: disallow the do=search for crawler.
User-agent: *
Disallow: /*?*do=search*
Syntax
The information below comes from:
Disallow
An empty value for “Disallow”, indicates that all URIs can be retrieved. At least one “Disallow” field must be present in the robots.txt file.
The “Disallow” field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,
- to disallows both /help.html and /help/index.html
Disallow: /help
- to disallow /help/index.html but allow /help.html.
Disallow: /help/
Pattern
The pattern allows the following characters 4) and is matched against:
- the path
- and the query string
Character | |
---|---|
$ | Designates the end of the match pattern. A URI MUST end with a $. |
* | Designates 0 or more instances of any character |
# wildcard
allow: /this/*/exactly
# end of uri
allow: /this/path/exactly$
It the pattern should use one of the above character, it should be % encoded. For example:
Pattern | URI blocked |
---|---|
/path/file-with-a-%2A.html | https://www.example.com/path/file-with-a-*.html |
/path/foo-%24 | https://www.example.com/path/foo-$ |
Location
The location of the robots.txt is at the root of the endpoint. For instance: https://datacadamia.com/robots.txt
Test
You can test a robots.txt with the robots-testing-tool of google.