Robots.txt

A robots.txt file, when present in the root directory, indicates those areas of your site which should not be accessed or indexed by automated site crawlers (also called spiders) such as those used by search engines.

While spiders are supposed to follow the instructions contained within the robots.txt file, none are compelled to do so. Major search engines usually follow their instructions. Scores of other spiders, such as those used by spammers to collect email addresses, do not.

Robots.txt File Discussions

Searching robots.txt on Google will reveal scores of results. It is one of those regularly discussed topics on many discussion forums, including our own.

Opinions vary from something short and to the point, to endless lists of disallows. There are three points to really keep in mind:

  • An improperly written robots.txt file can more harm than good, and disallow the indexing of content you’d like to see in a search engine.
  • The robots.txt file being itself accessible, it provides a roadmap to all of the content you might want to keep private. Never consider trying to hide sensitive material by use of the robots.txt file. Any human visitor will have ready access.
  • Scores of spiders ignore robots.txt files altogether. The latter include those used by spammers, but not only. Spiders used by desktop applications may ignore it as well, in order to allow their users to experience a faster browsing experience or search within bookmarks functionality.

Recommended robots.txt file

In the Semiologic forum, Denis de Bernardy gave the following recommendation:

User-agent: *
Disallow: /wp-*
Allow: /wp-content/uploads

The above will strip out every file and folder that starts wp-, with an exception for the site’s uploads folder.

It will suffice for nearly all users. And we firmly believe users should focus more on creating content than on trying to generate the perfect robots.txt file.

Power users who wish to exclude ill-behaving spiders should exclude them using htaccess file instead: ill-behaving spiders cannot ignore an htaccess file, whereas they can use a robots.txt file to locate content.

Google Webmaster Tools has instructions for creating a robots.txt file and an excellent explanation of the terminology used in the creation. And if you choose to ignore our own recommendations, Google’s certainly qualify as a worthy second opinion.

Creating a robots.txt

Nothing is simpler in Semiologic Pro: Browse Settings / Robots.txt, and start editing. It defaults to our recommended file.

Your site’s sitemap will automatically be appended to it if the XML Sitemaps plugin is active.