User-agent: * # applies to all robots Disallow: / # disallow indexing of all pages
Disallow a subdirectory
# Group 1 User-agent: Googlebot Disallow: /nogooglebot/ # Group 2 User-agent: * Allow: / Sitemap: http://www.example.com/sitemap.xml
- Googlebot should not crawl http://www.example.com/nogooglebot and all sub directory
- All other user agents can access the entire site.
Delay between page view
The crawl delay is the number of seconds the bot should wait between pageview
It does not represent a crawl rate.
Instead, it defines the size of a time window (from 1 to 30 seconds) during which the Bot will crawl your website only once.
For example, if your crawl delay is 5, you will get 17,280 crawled.
<MATH> 24 hour * 60 minutes * 60 minutes / 5 seconds = 17,280 pages </MATH>
User-agent: * Disallow: Allow: /* Crawl-delay: 5
Disallow a query search
You can disallow the use of a parameter in a URL query
Example: disallow the do=search for crawler.
User-agent: * Disallow: /*?*do=search*
An empty value for “Disallow”, indicates that all URIs can be retrieved. At least one “Disallow” field must be present in the robots.txt file.
The “Disallow” field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,
- to disallows both /help.html and /help/index.html
- to disallow /help/index.html but allow /help.html.
The pattern allows the following characters 4) and is matched against:
|$||Designates the end of the match pattern. A URI MUST end with a $.|
|*||Designates 0 or more instances of any character|
# wildcard allow: /this/*/exactly # end of uri allow: /this/path/exactly$
It the pattern should use one of the above character, it should be % encoded. For example:
You can test a robots.txt with the robots-testing-tool of google.