|
#1
| ||||
| ||||
|
Search engine spiders and similar robots will look for a robots.txt file, located in your main web directory. This is a plain text file. Create or modify it with a text editor and be sure to upload (FTP) it in ASCII mode. This file is used to exclude robots from sections of your web site, so they won't read files in those areas. 1. What are these robots? These are mostly automated software which fetches content on many web sites for a variety of purposes. Search engines often call these spiders and send them out to look for pages to include in their search results. Some spammers also use this technology to harvest email addresses to send their junk mail to. Other uses include bots looking for illegal files or content. 2. How do I create a robots.txt file? The syntax is very limited and easy to understand. The first part specifies the robot we are referring to. User-agent: BotName Replace BotName with the robot name in question. To address all of them, simply use an asterisk. User-agent: * The second part tells the robot in question not to enter certain parts of your web site. Disallow: /cgi-bin/ In this example, any path on our site starting with the string /cgi-bin/ is declared off limits. Multiple paths can be excluded per robot by using several Disallow lines. User-agent: * Disallow: /cgi-bin/ Disallow: /temp/ Disallow: /private This robots.txt file would apply to all bots and instruct them to stay out of directories /cgi-bin/ and /temp/. It also tells them any path/URL on your site starting with /private (files and directories) is off limits. To declare your entire website off limits to BotName, use the example shown below. User-agent: BotName Disallow: / To have a generic robots.txt file which welcomes every robot and does not restrict them, use this sample. User-agent: * Disallow: This beginner's tutorial includes a list of common robot names to get you started. Many others exist. Some bots will ignore robots.txt files as they don't care if you want them on your web site or not. These can be blocked by using a .htaccess file instead. 1. Block robots via .htaccess We can't block by robot name here, we block them by matching the beginning of their User-Agent string. SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot SetEnvIfNoCase User-Agent "^Teleport" bad_bot SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot <Limit GET POST> Order Allow,Deny Allow from all Deny from env=bad_bot </Limit> This example bans a list of spambots. To block another robot, add a line for it near the top. SetEnvIfNoCase User-Agent "^User-Agent" bad_bot Replace User-Agent with the User-Agent string for this robot, as found in log files. Here's a sample log entry. xyz.net - - [07/Mar/2003:11:28:35] "GET / HTTP/1.0" 403 - "-" "Teleport 1.28" Here, the User-Agent is Teleport 1.28. The ^ character in the SetEnvIfNoCase lines tells our .htaccess file to block anything starting with the string we provide. Any User-Agent starting directly with Teleport would be blocked, regardless of version number or added text.
__________________ Knowledgebase | SEO | Free Scripts | Free Graphics | Free Wordpress Themes | Free Word Cloud Script | Domains For Sale | Optimize Your Forum |
![]() Blog Comment Software |
![]() |
| Tools | |
| Display Modes | |
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Blocking Bad Bots with Robots.txt | Admin | Knowledgebase | 1 | 04-21-2008 02:17 AM |
| Changing File & Folder Permissions CHMOD | Admin | Knowledgebase | 1 | 04-15-2007 08:18 AM |