Differences

This shows you the differences between two versions of the page.

--- apache:use_.htaccess_to_hard-block_spiders_and_crawlers [2016/11/10 10:05] – peter
+++ apache:use_.htaccess_to_hard-block_spiders_and_crawlers [2023/07/17 11:20] (current) – removed peter
@@ Line 1: / Line 1: @@
-====== Apache - Use .htaccess to hard-block spiders and crawlers ======
-The .htaccess is a (hidden) file which can be found in any directory.
-<WRAP important>
-<color red>**WARNING**</color>: Make a backup copy of the .htaccess file, as one dot or one comma too much or too little, can render your site inaccessible.
-</WRAP>
-One of the things you can do, with **.htaccess**, is redirect web requests coming from certain IP addresses or user agents.
-This blocks excessively active crawlers/bots by catching a string in the USER_AGENT field, and redirect their web requests to a **"403 – Forbidden"**, before the request even hits the webserver.
-Add the following lines to a website's .htaccess file:
-<file bash .htaccess>
-# Redirect bad bots to one page.
-RewriteEngine on
-RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR]
-RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR]
-RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR]
-RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR]
-RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR]
-RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC]
-RewriteCond %{REQUEST_URI} !\/nocrawler.html
-RewriteRule .* http://yoursite/nocrawler.html [L]
-</file>
-This catches the server-hogging spiders, bots, crawlers with a substring of their user.agent’s name (case insensitive).  End each line with the user-agent string with **[NC, OR]**, except the last bot which has **[NC]** only.
-This piece of code redirects the unwanted crawlers to a dummy html file http://yoursite/nocrawler.html in your root directory.
-An example could be:
-<file html nocrawler.html>
-<!DOCTYPE html>
-<html>
-<body>
-<p>This crawler was blocked</p>
-</body>
-</html>
-</file>
-<color red>**NOTE:**</color>  The last line **RewriteCond %{REQUEST_URI} !\/nocrawler.html** is needed to avoid looping.
-===== Alternative Approach =====
-The previous method re-directed any request from the blocked spiders or crawlers to one page.  That is the “friendly” way.  However, it you get A LOT of spider requests, this also means that your Apache server will do double work: It will get the original request, which is redirected, and then get a 2nd request to deliver your “nocrawler.htm”-file.
-While it will help prevent bots, spiders and crawlers, it won’t ease off the pressure on your Apache server.
-A hard -and simple- way to block unwanted spiders, crawlers and other bots, is to return a **"403 – Forbidden"**, and that is the end of it.
-Add this code in your .htaccess:
-<file bash .htaccess>
-# Block bad bots with a 403.
-SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
-SetEnvIfNoCase User-Agent "Twitterbot" bad_bot
-SetEnvIfNoCase User-Agent "Baiduspider" bad_bot
-SetEnvIfNoCase User-Agent "MetaURI" bad_bot
-SetEnvIfNoCase User-Agent "mediawords" bad_bot
-SetEnvIfNoCase User-Agent "FlipboardProxy" bad_bot
-<Limit GET POST HEAD>
-  Order Allow,Deny
-  Allow from all
-  Deny from env=bad_bot
-</Limit>
-</file>
-===== Deny by IP Address =====
-Block attempts from 123.234.11.* and 192.168.12.*.
-<file bash .htaccess>
-# Deny malicous crawlers IP addresses.
-deny from 123.234.11.
-deny from 192.168.12.
-</file>