How to control access of the web crawlers or web robots to your site

Categories: Web development

There are numerous reasons as to why or when you should control the access of the web robots or web crawlers to your site.  As much as you want Googlebot to come to you site, you don’t want the spam bots to come and collect private information from your site. Not to mention that when a robot crawls your site it uses the website’s bandwidth too! In this post I have explained how you can control the access of the web robots to your site through the usage of a simple ‘robots.txt’ file.

What are web robots or web spiders?

Web Robots (also known as bots, web spiders, web crawlers, Ants) are programs that traverses the World Wide Web in an automated manner. Search engines (like Google, Yahoo etc.) use web crawlers to index the web pages to provide up to date data.

Why use ‘robots.txt’ file?

Gooble bot may be crawling your site to provide better search results but at the same time other spam bots may be collecting personal information such as email addresses for spamming purpose. If you want to control the access of the web crawlers on your site, you can do so by using the “robots.txt” file.

How do I create ‘robots.txt’ file?

‘robots.txt’ is a plain text file. Use any text editor to create the ‘robots.txt’ file.

‘robots.txt’ file format

The entries (rules) in the robots.txt file are entered in a ‘field’ ‘value’ pair.
<field>:<value>

A simple robots.txt file uses the following three fields:

User-agent: the web robot the following rule applies to.
Disallow: the URL you want to block the robot from accessing.
Allow: the URL you want to allow the robot to access.

Examples

The following will stop all robots from crawling your site (‘*’ means all and ‘/’ is the root directory.)

User-agent: *
Disallow: /

The following will stop all robots from crawling the ‘/private’ directory.

User-agent: *
Disallow: /private

Stops Googlebot from indexing your images for Google image search. Use this to save bandwidth if u don’t want your images to be available for Google image search. Read the Reduce Bandwidth Usage post to learn more.

User-agent: Googlebot-Image
Disallow: /

The following will block all robots from crawling your site except Googlebot

User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /

Where to put the robots.txt file?

Put the robots.txt file in the root directory of your website. For example, put the file in the www.yoursite.com not in a sub-directory like www.yoursite.com/sub-directory. In most cases it will be the “public_html” directory of your site.

You can verify that a bot that is visiting your site is really the Googlebot by following the instruction on this page.

Articles you may also like:

  1. Reduce your website’s bandwidth and storage usage
Tags: , ,

Subscribe to Tips and Tricks HQ to stay informed

twitter_icon

12 Responses.

  • #1 by Adam on December 28, 2011 - 12:39 pm

    Thanks for teaching me how to prevent Google from seeing my website in these moments when I’m clumsily trying to install a CMS on it… and it is mess, wouldn’t want it to be indexed like that.

  • #2 by thanksgiving 2011 on August 29, 2011 - 11:37 am

    needed this for a thanksgiving 2011 website I am creating right now. thanks a lot

  • #3 by easter on July 29, 2011 - 4:55 pm

    thanks a lot for this tip. i was looking for that for my easter website. this article had everything i needed to know about how to control the access of robots to your site. cheers

  • #4 by Togrul on January 18, 2011 - 9:27 am

    Thanks for sharing, again.

    Cheers,
    Togrul
    .-= Togrul´s last blog ..How to get more people to follow you on Twitter =-.

  • #5 by John Gamings on January 3, 2011 - 11:07 am

    Nice article. Ever since I added a robots.txt file to my site, I’ve noticed my pages get indexed almost instantly after publishing them. Probably because I added a line to show the google bots where my sitemap is

  • #6 by Jim on December 14, 2010 - 10:37 pm

    This is a good start for what is an extremely important and complicated process for perfecting the effectiveness of any SEO efforts you are putting in to your site. Of course it just gets trickier from here, but having this much under your belt will give even the most inexperienced webmaster a leg-up.
    .-= Jim´s last blog ..financial firm inspires trust =-.

  • #7 by Rapid Prototyping on August 19, 2010 - 1:27 am

    I must say that this information was really necessary for me. First of all, I just started a new website and to say the truth I have added some of my information into it and after reading this article I was so much worried about whether my privacy would get compromised. I am also a newbie and hence I really did panic. Thanks to you guys, I do have some confidence now and I have made the text file with those lyrics like values!

  • #8 by website laten maken on August 10, 2010 - 4:49 pm

    Tip: also use a robots.txt for test environments and temporary sites like domain.com/temporary/ and stuff. Spiders might also crawl that directories and you don’t want them to be indexed.
    .-= website laten maken´s last blog ..Bloggen voor een betere ranking of voor meer bezoekers =-.

  • #9 by Adam@How To Make Money Online on July 26, 2010 - 10:05 am

    Thanks ,i agree with you that robots text helping to crawl your pages.
    But the disallow have benefit too if you have private page or you are promoting product and you want to keep your download page of this product hidden ,this disallow can help.
    .-= Adam@How To Make Money Online´s last blog ..Easy Way To Make Money On Internet With Paypal Payment =-.

  • #10 by Webdesign Roosendaal on March 4, 2010 - 12:35 pm

    As a freelance webdeveloper, I’m always taking care of the little details. The same goes for using robots.txt. I always put in, even when bots are allowed to crawl everywhere.

    Why? Because a lot of bots and spiders are looking for it all the time and return a 404 message when they can’t find it. Therefore, I always include it in the root directory of the websites. It saves a lot of unnecessary traffic.
    .-= Webdesign Roosendaal´s last blog ..Professionele presentatie bouwbureau Holm de Jong =-.

  • #11 by Aaron Wakling on November 12, 2008 - 1:15 am

    I discovered your homepage by coincidence.
    Very interesting posts and well written.
    I will put your site on my blogroll.
    :-)

Featured & Popular Articles

Tips and Tricks Hot Items

wordpress estore plugin
wordpress membership plugin
WordPress PDF Stamper Plugin
WordPress Lightbox Ultimate Plugin
WordPress Affiliate Link Manager Plugin
wordpress affiliate plugin