How to control access of the web crawlers or web robots to your site

Last updated: April 12, 2013

There are numerous reasons as to why or when you should control the access of the web robots or web crawlers to your site.Â As much as you want Googlebot to come to you site, you don’t want the spam bots to come and collect private information from your site. Not to mention that when a robot crawls your site it uses the website’s bandwidth too! In this post I have explained how you can control the access of the web robots to your site through the usage of a simple ‘robots.txt’ file.

What are web robots or web spiders?

Web Robots (also known as bots, web spiders, web crawlers, Ants) are programs that traverses the World Wide Web in an automated manner. Search engines (like Google, Yahoo etc.) use web crawlers to index the web pages to provide up to date data.

Why use ‘robots.txt’ file?

Gooble bot may be crawling your site to provide better search results but at the same time other spam bots may be collecting personal information such as email addresses for spamming purpose. If you want to control the access of the web crawlers on your site, you can do so by using the “robots.txt” file.

How do I create ‘robots.txt’ file?

‘robots.txt’ is a plain text file. Use any text editor to create the ‘robots.txt’ file.

‘robots.txt’ file format

The entries (rules) in the robots.txt file are entered in a ‘field’ ‘value’ pair.
<field>:<value>

A simple robots.txt file uses the following three fields:

User-agent: the web robot the following rule applies to.
Disallow: the URL you want to block the robot from accessing.
Allow: the URL you want to allow the robot to access.

Examples

The following will stop all robots from crawling your site (‘*’ means all and ‘/’ is the root directory.)
User-agent: * Disallow: /
The following will stop all robots from crawling the ‘/private’ directory.
User-agent: * Disallow: /private
Stops Googlebot from indexing your images for Google image search. Use this to save bandwidth if u don’t want your images to be available for Google image search. Read the Reduce Bandwidth Usage post to learn more.
User-agent: Googlebot-Image Disallow: /
The following will block all robots from crawling your site except Googlebot
User-agent: * Disallow: / User-agent: Googlebot Allow: /

Where to put the robots.txt file?

Put the robots.txt file in the root directory of your website. For example, put the file in the www.yoursite.com not in a sub-directory like www.yoursite.com/sub-directory. In most cases it will be the “public_html” directory of your site.

You can verify that a bot that is visiting your site is really the Googlebot by following the instruction on this page.

Comments (18 responses)

admin says:

April 12, 2013 at 8:09 pm

If a bot doesn’t respect the directives in the robots.txt then you can’t really do anything about it. Your option would be block that bot in some other means like a htaccess restriction.
Harshan R says:

April 12, 2013 at 7:27 am

What if the ‘spam bot’ doesn’t look for robots.txt and just does what it is designed to do – crawl away?
Nani says:

July 17, 2012 at 3:18 pm

Ever since I added a robots.txt file to my site, Iâ€™ve noticed my pages get indexed almost instantly after publishing them. Probably because I added a line to show the google bots where my sitemap is
Neha says:

June 2, 2012 at 10:48 am

Nicely explained article.When new bloggers start their site , they are unaware of many important things like robots.Your tip helped me in doing optimization.
Fergus says:

April 29, 2012 at 2:59 pm

Just went through about ten sites trying to learn about this. Yours was the first article that was well written and you understand and remember what someone needs to know when they are reading an article like this. Better than Wikipedia. Thanks for the link.
Anna Hettick says:

April 6, 2012 at 10:37 am

thank you for such an easy to understand article! I have no idea about coding and such and this told just what I needed to know!!
Robert Hawkins says:

March 14, 2012 at 5:14 pm

Great tips. The only thing I would ad is that it might be a good idea to block images in robots.txt. The traffic from images is crap anyway and it’s unnecessary traffic to you website.
Adam says:

December 28, 2011 at 12:39 pm

Thanks for teaching me how to prevent Google from seeing my website in these moments when I’m clumsily trying to install a CMS on it… and it is mess, wouldn’t want it to be indexed like that.
thanksgiving 2011 says:

August 29, 2011 at 11:37 am

needed this for a thanksgiving 2011 website I am creating right now. thanks a lot
easter says:

July 29, 2011 at 4:55 pm

thanks a lot for this tip. i was looking for that for my easter website. this article had everything i needed to know about how to control the access of robots to your site. cheers
Togrul says:

January 18, 2011 at 9:27 am

Thanks for sharing, again.

Cheers,
Togrul
John Gamings says:

January 3, 2011 at 11:07 am

Nice article. Ever since I added a robots.txt file to my site, I’ve noticed my pages get indexed almost instantly after publishing them. Probably because I added a line to show the google bots where my sitemap is
Jim says:

December 14, 2010 at 10:37 pm

This is a good start for what is an extremely important and complicated process for perfecting the effectiveness of any SEO efforts you are putting in to your site. Of course it just gets trickier from here, but having this much under your belt will give even the most inexperienced webmaster a leg-up.
Rapid Prototyping says:

August 19, 2010 at 1:27 am

I must say that this information was really necessary for me. First of all, I just started a new website and to say the truth I have added some of my information into it and after reading this article I was so much worried about whether my privacy would get compromised. I am also a newbie and hence I really did panic. Thanks to you guys, I do have some confidence now and I have made the text file with those lyrics like values!
website laten maken says:

August 10, 2010 at 4:49 pm

Tip: also use a robots.txt for test environments and temporary sites like domain.com/temporary/ and stuff. Spiders might also crawl that directories and you don’t want them to be indexed.
Adam@How To Make Money Online says:

July 26, 2010 at 10:05 am

Thanks ,i agree with you that robots text helping to crawl your pages.
But the disallow have benefit too if you have private page or you are promoting product and you want to keep your download page of this product hidden ,this disallow can help.
Webdesign Roosendaal says:

March 4, 2010 at 12:35 pm

As a freelance webdeveloper, I’m always taking care of the little details. The same goes for using robots.txt. I always put in, even when bots are allowed to crawl everywhere.

Why? Because a lot of bots and spiders are looking for it all the time and return a 404 message when they can’t find it. Therefore, I always include it in the root directory of the websites. It saves a lot of unnecessary traffic.
Aaron Wakling says:

November 12, 2008 at 1:15 am

I discovered your homepage by coincidence.
Very interesting posts and well written.
I will put your site on my blogroll.
🙂

What are web robots or web spiders?

Why use ‘robots.txt’ file?

How do I create ‘robots.txt’ file?

‘robots.txt’ file format

Examples

Where to put the robots.txt file?

Company

Top WordPress Plugins

Blogging Tips

Search

Keep In Touch

What are web robots or web spiders?

Why use ‘robots.txt’ file?

How do I create ‘robots.txt’ file?

‘robots.txt’ file format

Examples

Where to put the robots.txt file?

Related Posts

Reader Interactions

Comments (18 responses)

Leave a Reply