Preventing your site being crawled and indexed by robots.txt

I guess the first question you may ask is:

  • Why would you NOT want Google to index your website?

Obviously there are many, many good reasons for both marketing and other commercial reasons as to why you would want your website indexed by Google. In fact, most tech savvy businesses are forever battling to improve their 'Google rank' so they appear at the top of as many search phrases as possible.

However, there are many sane reasons why people may not want some parts or all of a given site indexed including:

  • Development of new website or application that you don't wish to be available for indexing
  • Pages which contain dynamic content which changes frequently and has no benefit of being index/cached
  • Other content you simply don't want to end up appearing on Google and other search engines (not be confused with security through obscurity)

A robots.txt file in a web hosting account can be used to prevent your website from being accessed by Web Robots (also known as Web Wanderers, Crawlers, or Spiders) which are employed by all major search engines.

Robots are often used by search engines such as Google to categorize and archive web sites, or by webmasters to proofread source code. Marketers can also use them to scan for email addresses, phone numbers or addresses to generate contact lists for marketing purposes.

Before a robot visits a website, most reputable robots will first check the robots.txt file to determine what it should and shouldn't index. This should be placed in the top-level directory of your webserver.

An example robots.txt file:

User-agent: *
Disallow: /

The User-agent: * rule means this applies to all robots, and the Disallow: / rule tells the robot that it should not visit any pages on the site.

If you only wanted to disallow Google, for example, you could use:

User-agent: Googlebot*
Disallow: /

You can add specific pages to disallow in this file. For example:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

This will exclude all robots from those parts of the website content only. Everything not explicitly disallowed is able to be retrieved.

Each Subdomain will need its own robots.txt file. I.e. a robots.txt file for example.com will not work for new.example.com

Please note that the robots.txt file is a publicly available file, which means that anyone can view it. Therefore you should not place valuable information in this file.