Differences

This shows you the differences between two versions of the page.

--- apache:use_.htaccess_to_hard-block_spiders_and_crawlers [2016/10/09 12:58] – peter
+++ apache:use_.htaccess_to_hard-block_spiders_and_crawlers [2023/07/17 11:20] (current) – removed peter
@@ Line 1: / Line 1: @@
-====== Apache - Use .htaccess to hard-block spiders and crawlers ======
-The .htaccess is a (hidden) file which can be found in any directory.
-<color red>**WARNING**</color>: Make a backup copy of the .htaccess file, as one dot or one comma too much or too little, can render your site inaccessible.
-One of the things you can do, with **.htaccess**, is redirect web requests coming from certain IP addresses or user agents.
-This blocks excessively active crawlers/bots by catching a string in the USER_AGENT field, and redirect their web requests to a “403 – Forbidden”, before the request even hits the webserver.
-Add the following lines to a website's .htaccess file:
-<file bash .htaccess>
-# Redirect bad bots to one page.
-RewriteEngine on
-RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR]
-RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR]
-RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR]
-RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR]
-RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR]
-RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC]
-RewriteCond %{REQUEST_URI} !\/nocrawler.html
-RewriteRule .* http://yoursite/nocrawler.html [L]
-</file>
-This catches the server-hogging spiders, bots, crawlers with a substring of their user.agent’s name (case insensitive).  End each line with the user-agent string with **[NC, OR]**, except the last bot which has **[NC]** only.
-This piece of code redirects the unwanted crawlers to a dummy html file http://yoursite/nocrawler.html in your root directory.
-An example could be:
-<file html nocrawler.html>
-<!DOCTYPE html>
-<html>
-<body>
-<p>This crawler was blocked</p>
-</body>
-</html>
-</file>
-<color red>**NOTE:**<color>  The last line **RewriteCond %{REQUEST_URI} !\/nocrawler.html** is needed to avoid looping.