This is an old revision of the document!

Apache - Use .htaccess to hard-block spiders and crawlers

The .htaccess is a (hidden) file which can be found in any directory.

WARNING: Make a backup copy of the .htaccess file, as one dot or one comma too much or too little, can render your site inaccessible.

One of the things you can do, with .htaccess, is redirect web requests coming from certain IP addresses or user agents.

This blocks excessively active crawlers/bots by catching a string in the USER_AGENT field, and redirect their web requests to a “403 – Forbidden”, before the request even hits the webserver.

Add the following lines to a website's .htaccess file:

.htaccess

#redirect bad bots to one page 
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR] 
RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR]
RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC]
RewriteCond %{REQUEST_URI} !\/nocrawler.htm
RewriteRule .* http://yoursite/nocrawler.htm [L]