User Tools

Site Tools


apache:use_.htaccess_to_hard-block_spiders_and_crawlers

This is an old revision of the document!


Apache - Use .htaccess to hard-block spiders and crawlers

The .htaccess is a (hidden) file which can be found in any directory.

WARNING: Make a backup copy of the .htaccess file, as one dot or one comma too much or too little, can render your site inaccessible.

One of the things you can do, with .htaccess, is redirect web requests coming from certain IP addresses or user agents.

This blocks excessively active crawlers/bots by catching a string in the USER_AGENT field, and redirect their web requests to a “403 – Forbidden”, before the request even hits the webserver.

Add the following lines to a website's .htaccess file:

.htaccess
# Redirect bad bots to one page.
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR] 
RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR]
RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC]
RewriteCond %{REQUEST_URI} !\/nocrawler.html
RewriteRule .* http://yoursite/nocrawler.html [L]

This catches the server-hogging spiders, bots, crawlers with a substring of their user.agent’s name (case insensitive). End each line with the user-agent string with [NC, OR], except the last bot which has [NC] only.

This piece of code redirects the unwanted crawlers to a dummy html file http://yoursite/nocrawler.html in your root directory.

An example could be:

nocrawler.html
<!DOCTYPE html>
<html>
<body>
<p>This crawler was blocked</p>
</body>
</html> 

<color red>NOTE:<color> The last line RewriteCond %{REQUEST_URI} !\/nocrawler.html is needed to avoid looping.

apache/use_.htaccess_to_hard-block_spiders_and_crawlers.1476017931.txt.gz · Last modified: 2020/07/15 09:30 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki