How to build a Bot Trap and keep bad bots away from a web site

KLOTH.NET - Trap bad bots in a bot trap

How to build a Bot Trap and keep bad bots away from a web site
Block spam bots and other bad bots from accessing and scanning your web site

Many bad bots either carry wellknown bad user-agent names (well, most modern bad bots don't and hide behind faked UA strings (either other bots' names or user browser UA stings (like "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" et al.), but some old silly ones still do, like "Internet Explore 5.x", "Mozilla/3.0 (compatible; Indy Library)") or ignore the www.robotstxt.org/wc/exclusion.html robots.txt standard.
Originally, the Indy Library is a programming library which is available at http://www.nevrona.com/Indy or http://indy.torry.net under an Open Source license. This library is included with Borland Delphi 6, 7, C++Builder 6, plus all of the Kylix versions. Unfortunately, this library is hi-jacked and abused by some Chinese spam bots. All recent user-agents with the unmodified "Indy Library" string were of Chinese origin.

The following may help against rude and robots.txt-ignorant bots. However, there are more bad but smarter bots out there, which need more sophisticated countermeasures.
How to catch those bots ignoring and violating the robots.txt standard ?
Method 1 - using PHP
Trap them in a special /bot-trap directory:
Disclaimer: The following scripts are working for me. I share them just for your information. If you use them, you are doing so at own risk. Be sure not to screw up your web site or web server.

    set up a special subdirectory /bot-trap (use another own name),
    put an exclude statement into your /robots.txt,

        User-agent: *
        Disallow: /bot-trap/

    include a hidden link (a transparent 1x pixel GIF) at the beginning of your main entrance page /

      <a href="/bot-trap/"><img src="images/pixel.gif" border="0" alt=" " width="1" height="1"></a>

    and wait watching your webserver's log to see who gets trapped.
    Most human users won't ever see this link, and good bots (like googlebot, etc) honor the robots.txt directives and won't visit your /bot-trap.
    Caveats: this method is not bullet-proof, and there may be collateral damage, so check the trapped addresses regularly and in time to research and unblock those that may have been special and innocent user tools instead of bots.
    Own experience here is about 1 or 2 false ones among 100 hits.
    To prevent certain user-agents or IP address ranges from being trapped you could add some sort of whitelisting.
    Additionally, we have put a /bot-trap/index.php to notify the hostmaster by mail and automatically append the bot's IP address to an active blacklist file blacklist.dat.
    For the first start, create an empty ../blacklist.dat file and make it readable and writable for the web server. Here is the text of the /bad-bot/index.php:

     
        <html>
        <head><title> </title></head>
        <body>
        <p>There is nothing here to see. So what are you doing here ?</p>
        <p><a href="http://your.domain.tld/">Go home.</a></p>
        <?php
          /* whitelist: end processing end exit */
          if (preg_match("/10\.22\.33\.44/",$_SERVER['REMOTE_ADDR'])) { exit; }
          if (preg_match("Super Tool",$_SERVER['HTTP_USER_AGENT'])) { exit; }
          /* end of whitelist */
          $badbot = 0;
          /* scan the blacklist.dat file for addresses of SPAM robots
             to prevent filling it up with duplicates */
          $filename = "../blacklist.dat";
          $fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
          while ($line = fgets($fp,255)) {
            $u = explode(" ",$line);
            $u0 = $u[0];
            if (preg_match("/$u0/",$_SERVER['REMOTE_ADDR'])) {$badbot++;}
          }
          fclose($fp);
          if ($badbot == 0) { /* we just see a new bad bot not yet listed ! */
          /* send a mail to hostmaster */
            $tmestamp = time();
            $datum = date("Y-m-d (D) H:i:s",$tmestamp);
            $from = "badbot-watch@domain.tld";
            $to = "hostmaster@domain.tld";
            $subject = "domain-tld alert: bad robot";
            $msg = "A bad robot hit $_SERVER['REQUEST_URI'] $datum \n";
            $msg .= "address is $_SERVER['REMOTE_ADDR'], agent is $_SERVER['HTTP_USER_AGENT']\n";
            mail($to, $subject, $msg, "From: $from");
          /* append bad bot address data to blacklist log file: */
            $fp = fopen($filename,'a+');
            fwrite($fp,"$_SERVER['REMOTE_ADDR'] - - [$datum] \"$_SERVER['REQUEST_METHOD'] $_SERVER['REQUEST_URI'] $_SERVER['SERVER_PROTOCOL']\" $_SERVER['HTTP_REFERER'] $_SERVER['HTTP_USER_AGENT']\n");
            fclose($fp);
          }
        ?>
        </body>
        </html>

    To exclude those known bad bots from further visits of my pages, all my active PHP pages begin with a check of the blacklist.dat file by including this statement at the beginning:

        <?php include($DOCUMENT_ROOT . "/blacklist.php"); ?> 

    Here is the text of the blacklist.php include file to exclude the bad ones:

        <?php
        $badbot = 0;
        /* look for the IP address in the blacklist file */
        $filename = "../blacklist.dat";
        $fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
        while ($line = fgets($fp,255))  {
          $u = explode(" ",$line);
          $u0 = $u[0];
          if (preg_match("/$u0/",$_SERVER['REMOTE_ADDR'])) {$badbot++;}
        }
        fclose($fp);
        if ($badbot > 0) { /* this is a bad bot, reject it */
          sleep(12);
          print ("<html><head>\n");
          print ("<title>Site unavailable, sorry</title>\n");
          print ("</head><body>\n");
          print ("<center><h1>Welcome ...</h1></center>\n");
          print ("<p><center>Unfortunately, due to abuse, this site is temporarily not available ...</center></p>\n");
          print ("<p><center>If you feel this in error, send a mail to the hostmaster at this site,<br>
                 if you are an anti-social ill-behaving SPAM-bot, then just go away.</center></p>\n");
          print ("</body></html>\n");
          exit;
        }
        ?>

    [Please don't try to locate and access my /bot-trap, as your IP address will be locked out immediately...].

Method 2 - using .htaccess

Another method than the PHP stuff above (and working even better for your entire site) is to enter bad bots into the .htaccess file as follows (if you have access to your .htaccess and the Apache's configuration allows the following commands):

   SetEnvIfNoCase User-Agent "Indy Library" bad_bot
   SetEnvIfNoCase User-Agent "Internet Explore 5.x" bad_bot
   SetEnvIf Remote_Addr "195\.154\.174\.[0-9]+" bad_bot
   SetEnvIf Remote_Addr "211\.101\.[45]\.[0-9]+" bad_bot
   Order Allow,Deny
   Allow from all
   Deny from env=bad_bot

To automate this process, you may adapt the techniques of (4) to write to your .htaccess file automatically out of your /bot-trap. But you have to be sure about what you are really doing: if you screw up your .htaccess, your site will be dead.
If you have full access to your .htaccess, there are much more possibilities you can do. You may read www.webmasterworld.com/forum23/1281.htm for more details.
 
List of bad bots

Here is a list of spiders and bots seen so far on kloth.net and kloth.org ignoring the robots.txt standard.

Comments and amendments are appreciated. Please let me know, if there are other methods.
Send your comments to < hostmaster at kloth.net >
 
原文地址:https://www.cnblogs.com/lexus/p/3039408.html