How to Control Access of the Web Crawlers or Web Robots to Your Site

There are numerous reasons as to why or when you should control the access of the web robots or web crawlers to your site.  As much as you want Googlebot to come to you site, you don’t want the spam bots to come and collect private information from your site. Not to mention that when a robot crawls your site it uses the website’s bandwidth too! In this post I have explained how you can control the access of the web robots to your site through the usage of a simple ‘robots.txt’ file.

What are web robots or web spiders?

Web Robots (also known as bots, web spiders, web crawlers, Ants) are programs that traverses the World Wide Web in an automated manner. Search engines (like Google, Yahoo etc.) use web crawlers to index the web pages to provide up to date data.

Why use ‘robots.txt’ file?

Gooble bot may be crawling your site to provide better search results but at the same time other spam bots may be collecting personal information such as email addresses for spamming purpose. If you want to control the access of the web crawlers on your site, you can do so by using the “robots.txt” file.

How do I create ‘robots.txt’ file?

‘robots.txt’ is a plain text file. Use any text editor to create the ‘robots.txt’ file.

‘robots.txt’ file format

The entries (rules) in the robots.txt file are entered in a ‘field’ ‘value’ pair.
<field>:<value>

A simple robots.txt file uses the following three fields:

User-agent: the web robot the following rule applies to.
Disallow: the URL you want to block the robot from accessing.
Allow: the URL you want to allow the robot to access.

Examples

The following will stop all robots from crawling your site (‘*’ means all and ‘/’ is the root directory.)

User-agent: *
Disallow: /

The following will stop all robots from crawling the ‘/private’ directory.

User-agent: *
Disallow: /private

Stops Googlebot from indexing your images for Google image search. Use this to save bandwidth if u don’t want your images to be available for Google image search. Read the Reduce Bandwidth Usage post to learn more.

User-agent: Googlebot-Image
Disallow: /

The following will block all robots from crawling your site except Googlebot

User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /

Where to put the robots.txt file?

Put the robots.txt file in the root directory of your website. For example, put the file in the www.yoursite.com not in a sub-directory like www.yoursite.com/sub-directory. In most cases it will be the “public_html” directory of your site.

You can verify that a bot that is visiting your site is really the Googlebot by following the instruction on this page.

Found this resource interesting? Subscribe to Tips and Tricks HQ

email icon rss feed icon twitter icon google plus icon

Comments (20 responses)

  1. admin says:

    If a bot doesn’t respect the directives in the robots.txt then you can’t really do anything about it. Your option would be block that bot in some other means like a htaccess restriction.

  2. Harshan R says:

    What if the ‘spam bot’ doesn’t look for robots.txt and just does what it is designed to do – crawl away?

  3. Nani says:

    Ever since I added a robots.txt file to my site, I’ve noticed my pages get indexed almost instantly after publishing them. Probably because I added a line to show the google bots where my sitemap is

  4. Neha says:

    Nicely explained article.When new bloggers start their site , they are unaware of many important things like robots.Your tip helped me in doing optimization.

  5. Fergus says:

    Just went through about ten sites trying to learn about this. Yours was the first article that was well written and you understand and remember what someone needs to know when they are reading an article like this. Better than Wikipedia. Thanks for the link.

  6. Anna Hettick says:

    thank you for such an easy to understand article! I have no idea about coding and such and this told just what I needed to know!!

  7. Great tips. The only thing I would ad is that it might be a good idea to block images in robots.txt. The traffic from images is crap anyway and it’s unnecessary traffic to you website.

  8. Adam says:

    Thanks for teaching me how to prevent Google from seeing my website in these moments when I’m clumsily trying to install a CMS on it… and it is mess, wouldn’t want it to be indexed like that.

  9. needed this for a thanksgiving 2011 website I am creating right now. thanks a lot

  10. easter says:

    thanks a lot for this tip. i was looking for that for my easter website. this article had everything i needed to know about how to control the access of robots to your site. cheers

  11. Togrul says:

    Thanks for sharing, again.

    Cheers,
    Togrul

  12. John Gamings says:

    Nice article. Ever since I added a robots.txt file to my site, I’ve noticed my pages get indexed almost instantly after publishing them. Probably because I added a line to show the google bots where my sitemap is

  13. Jim says:

    This is a good start for what is an extremely important and complicated process for perfecting the effectiveness of any SEO efforts you are putting in to your site. Of course it just gets trickier from here, but having this much under your belt will give even the most inexperienced webmaster a leg-up.

  14. I must say that this information was really necessary for me. First of all, I just started a new website and to say the truth I have added some of my information into it and after reading this article I was so much worried about whether my privacy would get compromised. I am also a newbie and hence I really did panic. Thanks to you guys, I do have some confidence now and I have made the text file with those lyrics like values!

  15. Tip: also use a robots.txt for test environments and temporary sites like domain.com/temporary/ and stuff. Spiders might also crawl that directories and you don’t want them to be indexed.

  16. Thanks ,i agree with you that robots text helping to crawl your pages.
    But the disallow have benefit too if you have private page or you are promoting product and you want to keep your download page of this product hidden ,this disallow can help.

  17. As a freelance webdeveloper, I’m always taking care of the little details. The same goes for using robots.txt. I always put in, even when bots are allowed to crawl everywhere.

    Why? Because a lot of bots and spiders are looking for it all the time and return a 404 message when they can’t find it. Therefore, I always include it in the root directory of the websites. It saves a lot of unnecessary traffic.

  18. I discovered your homepage by coincidence.
    Very interesting posts and well written.
    I will put your site on my blogroll.
    :-)

Speak Your Mind

*