Controlling bots

Time to Read: 5 minutes Difficulty Level: Beginner
Tools Needed: access to your hosting control panel (Plesk) Last Updated:

What are bots?

A web robot is a program that performs a specific function automatically on the web, usually abbreviated to a “bot”. This could be for search engine indexing or for HTML and link validation. Major SEO companies like Google and Bing have their own bots.

Bandwidth is the measure of data that is sent across the internet. Each time a person visits your website, bandwidth is used. The same applies for bots: each time a bot visits your site they use up bandwidth and server resources.

Ordinarily the bandwidth that bots use is relatively small, but bots can sometimes consume gigabytes of bandwidth which can be a problem for stores that already have high levels of traffic.

Whilst your website being visited regularly by bots is by no means a bad thing (and can in fact be perfectly normal if you are regularly adding new content to your website) problems arise when bots crawl your site too aggressively. It’s also possible for bots to become stuck on in infinite loops on your website. These can be caused by custom scripts, but are most often caused when Session IDs are served with each URL that is indexed. The constant activity of web robots on your site trapped in an infinite loop is what can sometimes cause such heavy bandwidth usage.

Using a robots.txt file to control bots
Magento 1 Robots.txt
Exceptions
Further reading

Using a robots.txt file to control bots

A robots.txt file is a simple txt file that sits in your web root folder and can be used to control:

1) the bots allowed to visit your website

2) the pages bots are allowed to access

3) the rate at which the bots are permitted to crawl.

One advantage of this is that you can prevent certain directories from being indexed by bots which can in turn provide SEO benefits by preventing indexing of duplicated content. A robots.txt also allows you to specify a Crawl-Delay value in order to restrict a bot from constantly indexing and crawling your website, helping to reduce the footprint they make on your bandwidth allocation and server resources.

The robots.txt file can (and should!) be customised to your own site’s needs. It will also need to be placed within your site’s top-level directory. This will usually, but not always, be /httpdocs. If you’re using Magento 2 and have locked the web root down to your /pub directory, then the file should be placed into /pub. Bear in mind that if your site is in /pub or another subdirectory, then you may have to modify your robots.txt file accordingly. For instance, you would need to change “Disallow: /app/” to “Disallow:/your-subdirectory/app”.

We have included a widely available robots.txt file for use with Magento which should help you to improve your site’s SEO as well as reducing bandwidth usage and server load:

Magento 1 Robots.txt

Please find below our recommended robots.txt for a standard Magento 1 store. We did not write this ourselves and do not claim to take credit for it - the file is widely available elsewhere on the internet.

# $Id: robots.txt,v magento-specific 2010/28/01 18:24:19 goba Exp $
#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these robots where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used:  http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
#
# For syntax checking, see:
# http://www.sxw.org.uk/computing/robots/check.html
# Sitemap: http://www.mywebsite.com/sitemap.xml

# Crawlers Setup
User-agent: *
Crawl-delay: 30
# Allowable Index
Allow: /*?p=
Allow: /index.php/blog/
Allow: /catalog/seo_sitemap/category/
Allow: /catalogsearch/result/
# Directories
Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /errors/
Disallow: /includes/
Disallow: /js/
Disallow: /lib/
Disallow: /magento/
Disallow: /media/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /scripts/
Disallow: /shell/
Disallow: /skin/
Disallow: /stats/
Disallow: /var/
# Paths (clean URLs)
Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/
# Files
Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt
# Paths (no clean URLs)
Disallow: /*.js$
Disallow: /*.css$
Disallow: /*.php$
Disallow: /*?p=*&
Disallow: /*?SID= 
Disallow: /*?limit=all

# Uncomment if you do not wish for Google to index your images
#User-agent: Googlebot-Image
#Disallow: / 

Exceptions

There are a few exceptions to the above rules. The robots.txt directives will only take effect if the bot reads the file and obeys it. Most friendly bots will do just that, but it’s unlikely that bots designed for malicious purposes will play to the rules.

In addition, BaiduBot and Googlebot will ignore any crawl-delay directive from robots.txt. In order to force these bots to crawl your site less often, you’ll need to sign up for a free webmaster tools account with them. Google have a guide on how to do this. Semrush have a guide to navigating Baidu’s webmaster tools settings here. Baidu is a Chinese bot so your browser’s translation features will be a necessity here!

Lastly, Bingbot will obey your robots.txt’s crawl-delay directive, but Bing’s webmaster tools also allow you to set a custom crawl pattern for the site instead. For example, you could use this to tell Bing to crawl the site more aggressively at a time when you expect your site to experience less traffic (for most sites this will be overnight, but this assumes that your primary market is based in a UK timezone).

Further reading

​More information about the controlling bots using a robots.txt file is available from the official website here and a comprehensive guide to using robots.txt to improve your site’s SEO is available here.