apache:use_.htaccess_to_hard-block_spiders_and_crawlers
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
apache:use_.htaccess_to_hard-block_spiders_and_crawlers [2016/11/10 10:05] – peter | apache:use_.htaccess_to_hard-block_spiders_and_crawlers [2023/07/17 11:20] (current) – removed peter | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Apache - Use .htaccess to hard-block spiders and crawlers ====== | ||
- | |||
- | The .htaccess is a (hidden) file which can be found in any directory. | ||
- | |||
- | <WRAP important> | ||
- | <color red> | ||
- | </ | ||
- | |||
- | One of the things you can do, with **.htaccess**, | ||
- | |||
- | This blocks excessively active crawlers/ | ||
- | |||
- | Add the following lines to a website' | ||
- | |||
- | <file bash .htaccess> | ||
- | # Redirect bad bots to one page. | ||
- | RewriteEngine on | ||
- | RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC, | ||
- | RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR] | ||
- | RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR] | ||
- | RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR] | ||
- | RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR] | ||
- | RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC] | ||
- | RewriteCond %{REQUEST_URI} !\/ | ||
- | RewriteRule .* http:// | ||
- | </ | ||
- | |||
- | This catches the server-hogging spiders, bots, crawlers with a substring of their user.agent’s name (case insensitive). | ||
- | |||
- | This piece of code redirects the unwanted crawlers to a dummy html file http:// | ||
- | |||
- | An example could be: | ||
- | |||
- | <file html nocrawler.html> | ||
- | < | ||
- | < | ||
- | < | ||
- | < | ||
- | </ | ||
- | </ | ||
- | </ | ||
- | |||
- | <color red> | ||
- | |||
- | |||
- | ===== Alternative Approach ===== | ||
- | |||
- | The previous method re-directed any request from the blocked spiders or crawlers to one page. That is the “friendly” way. However, it you get A LOT of spider requests, this also means that your Apache server will do double work: It will get the original request, which is redirected, and then get a 2nd request to deliver your “nocrawler.htm”-file. | ||
- | |||
- | While it will help prevent bots, spiders and crawlers, it won’t ease off the pressure on your Apache server. | ||
- | |||
- | A hard -and simple- way to block unwanted spiders, crawlers and other bots, is to return a **"403 – Forbidden" | ||
- | |||
- | Add this code in your .htaccess: | ||
- | |||
- | <file bash .htaccess> | ||
- | # Block bad bots with a 403. | ||
- | SetEnvIfNoCase User-Agent " | ||
- | SetEnvIfNoCase User-Agent " | ||
- | SetEnvIfNoCase User-Agent " | ||
- | SetEnvIfNoCase User-Agent " | ||
- | SetEnvIfNoCase User-Agent " | ||
- | SetEnvIfNoCase User-Agent " | ||
- | |||
- | <Limit GET POST HEAD> | ||
- | Order Allow,Deny | ||
- | Allow from all | ||
- | Deny from env=bad_bot | ||
- | </ | ||
- | </ | ||
- | |||
- | |||
- | |||
- | ===== Deny by IP Address ===== | ||
- | |||
- | Block attempts from 123.234.11.* and 192.168.12.*. | ||
- | |||
- | <file bash .htaccess> | ||
- | # Deny malicous crawlers IP addresses. | ||
- | deny from 123.234.11. | ||
- | deny from 192.168.12. | ||
- | </ | ||
- | |||
apache/use_.htaccess_to_hard-block_spiders_and_crawlers.1478772352.txt.gz · Last modified: 2020/07/15 09:30 (external edit)