The robots.txt file simply contains instructions for search engine robots on what to do with a particular website. While the search engine robots follow the instructions from that file, spam bots simply ignore it in most cases.
A web robot is a program that checks the content of a web page. If a robot is about to crawl a website, it will first check the robots.txt file for instructions. A command “Disallow”, for example, tells the robot not to visit a given set of pages on this site. Web administrators use this file to restrict the bots to index the content of a particular website for different reasons - they do not want the content to be accessible by other users; the website is under construction, or a certain part of the content must be hidden from the public.
While search engines such as Google use the robots to index web content and can be easily restricted and instructed by the robots.txt, spammers use spambots to reach e-mail addresses, for example, and do not follow the instructions from the robots.txt file. They look for and follow keywords that might be related to an e-mail address such as “post”, “message”, “journal” and so on. What is specific for a spambot is that it comes from many IP addresses and acts as different agents, and thus it can hardly be blocked. Some spambots even use search engines such as Google to look for particular information on a web page.
Fortunately, there are still things that can be done to prevent spambots of scanning your web site and stealing information. Neil Gunton came up with a Spambot Trap which blocks spambots and allows the good search engine spiders to visit your website.
Still, if you would like to leave instructions for the regular search engine bots which pages are to be indexed and which – not, you might want to be careful not to block the search engine completely. If you put in the wrong commands, your website will have no chance of showing up anywhere in search results. If you don’t have a robots.txt file at all, then the web robot will index every single thing that is on your website.
Here’s a list with some ready-to-use basic commands for the robots.txt file:
- Exclude a file from a certain search engine:
User-Agent: Googlebot Disallow: /private/privatefile.htm
- Exclude a section/page from your site from all web robots
User-Agent: * Disallow: /newsection/
- Disallow any bot to index any part of your website
User-agent: * Disallow: /
If you go through your server logs and see a suspicious host, you can run it by our Blacklist Checker. It will tell you if the domain or IP has been blacklisted. If this is the case, then you can simply prevent this host from entering again.