How to keep unwanted robots away from our server websites

botond published Jan. 2019, 11, 27:17 p.m. time

Content

 

Introductory

As we run more and more websites on our server, the additional traffic also increases, leading to additional workload. This excess load is largely caused by the traffic generated by the robots. The more sub-pages a web page consists of, the more crawlers automatically crawl the URL, which is also used to work with both the HTTP web server and PHP, which means unnecessary use of the hardware that runs them. Not to mention the bandwidth used by the requested web pages. Of course, not all of these robots are redundant, but most of them just uselessly move the server unnecessarily. In this description, we will look at two ways to keep these unnecessary robots away from our websites.

 

 

Robot Recognition and Distinction

As already mentioned, not all robots are redundant, but the vast majority of them are. "Robot etiquette" requires that when crawling pages, it be in the root directory of the website robots.txt , the robots must comply with the instructions. That is, if a particular robot is disabled in the file, or its visit speed is restricted, then it should obey it. However, most robots ignore it and go to our pages if we want to or not.

Big search giant robots like Google a Microsoft Bing crawlers or even Yahoo robots, etc. they take into account what is specified in the robots.txt file and may or may not crawl accordingly. Also, these robots are very useful because they index our pages in their search engines. So we definitely need to let the robots of these big search engines if we want good for ourselves. But what about the other camp who don’t deal with our desires?

View the Apache log file

The easiest way to recognize these is to look at, for example Apache or the appropriate Nginx log file. This tutorial covers Apache, so you can open the access.log file. This file is a LAMP system where there is a basic Apache installation on /var/log/apache2/access.log. Let's look at the last few lines:

cat /var/log/apache2/access.log | tail -5

My last installed Debian 10 (Buster) on LAMP server has long been an activity, so logrotate Thanks to the log rotation system, I already found a log entry in access.log.1:

View the Apache log file

Accessing the log file on another server may be different. For example, a ISPConfigserver environment, that is, with all perfect server configurations:
/ var / www / /log/access.log

Here, for example, highlighting a line, such as a phpMyAdmin retrieved the line of the png file, it looks like this:

192.168.1.100 - - [19/Nov/2019:15:02:58 +0100] "GET /phpmyadmin/themes/pmahomme/img/s_unlink.png HTTP/1.1" 200 873 "http://192.168.1.130/phpmyadmin/phpmyadmin.css.php?nocache=6286344183ltr&server=1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"

Includes client IP addresses, the date and time of the event, the request itself, its referrer, and finally the browser ID. Here, this most recent data is important to us because it allows us to identify the type of person who requested the file from the server.

Fortunately, most bots send their own browser ID so that we can identify and identify them from the web page (s).

For example, a feature line from GoogleBot (which of course we should never ban!) Looks like this on the server:

66.249.64.172 - - [26/Nov/2019:18:23:08 +0100] "GET /leirasok/web-hoszting/biztonsag/hogyan-kezeljuk-http-hitelesito-jelszavainkat-a-htpasswd-fajlban-a-htpasswd-parancs-segitsegevel HTTP/1.1" 200 25879 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

You just recently fetched a previous description from the page. So at the end of that line is the robot's browser ID.

Based on this, if we regularly monitor the Apache log file, we can collect the names of the robots.

Blacklisting

In the next step, we need to compile a "blacklist" of robots that we don't want to allow into our pages. Here I have just gathered a couple of what I will list:

  • SemrushBot
  • AhrefsBot
  • Mb2345Browser
  • MegaIndex.ru
  • MJ12bot
  • DotBot
  • Baiduspider
  • YandexBot
  • LieBaoFast
  • zh_CN
  • zh-CN
  • SeznamBot
  • trendictionbot
  • magpie-crawler

These are not complete browser IDs, but are unique parts of them that can be used to uniquely identify the robot listed. Of course there are many more, but they are not so frequent that I could now extract them from my today's / yesterday's log files.

A zh_CN and the zh-CN IDs do not always cover robots, but there are also browsers from China (with rather dubious names) that indicate their language area, but these are not sure to be real visits either. So I’ve seen most of these labels on robots so far. Thus, decide that e.g. Do Hungarian-language websites need Chinese traffic, which is mostly made up of robots? They may be missed on an international site, where real Chinese visits are also natural, in which case we consider the proportion of robots.

These names should be "|" characters (and escape characters) into the appropriate places where you want to disable. From this point on, later, for now, we are just assembling the filter line.

So our stitched line looks like this:

SemrushBot|AhrefsBot|Mb2345Browser|MegaIndex\.ru|MJ12bot|DotBot|Baiduspider|YandexBot|LieBaoFast|zh_CN|zh-CN|SeznamBot|trendictionbot|magpie-crawler

Unfortunately, I can't break this into multiple lines, it has to be one.

For items that contain points, make sure that the escape point is preceded by an escape character (\). Only MegaIndex.ru contains points here. You should also be case sensitive, so just as they appear in the log file, they should be just like that.

On my server, these robots were constantly dating, which I hadn’t dealt with before because they were generating insignificant traffic, but have been getting angry lately and pretty much tossing up CPU utilization on the server. So I exposed them yesterday.

Many places deal with browser IDs, for example here you can also have a treat even robots.

 

 

Disable robots with the Fail2Ban protection system

A Fail2ban a strong and highly effective protection system for Unix-like operating systems. It allows you to easily take up the fight against unwanted visitors. By analyzing the various log files, the program dynamically manages the appropriate firewall rules based on the filter rules set in it, so that we can configure server-wide protection globally. With Fail2Ban, we can provide all server services, but here we are now focusing only on the Apache web server.

Installation

If you are not already on Fail2 on your server, install it using the following command (on Debian-based servers):

apt-get -y install fail2ban

After installation, the program defaults are correct, so we will not cover them now.

You can even start it with these default settings:

systemctl start fail2ban

Create your own filter

The next step is to create your own filter. To do this, log in as root /etc/fail2ban/filter.d directory:

/etc/fail2ban/filter.d

This directory contains the program's built-in filters, which you can switch on / off later. Now create a new filter file in this directory:

nano apache-mycustombots.conf

And let's include the following section, which already includes our filter line above:

# Saját robot szűrő
[Definition]

mycustombots = SemrushBot|AhrefsBot|Mb2345Browser|MegaIndex\.ru|MJ12bot|DotBot|Baiduspider|YandexBot|LieBaoFast|zh_CN|zh-CN|SeznamBot|trendictionbot|magpie-crawler

failregex  = ^<HOST> .*(GET|POST|HEAD).*(%(mycustombots)s).*$

ignoreregex =

datepattern = ^[^\[]*\[({DATE})
              {^LN-BEG}
This could all be broken down into a multi-line failregex filter, in which we could enter the robots separately or even one by one, but this would allow better handling of the code in a single line if, for example, we needed to modify something in the regular expression. Save the file.

Test your own filter

Before turning on the filter, perform a dry test with our Apache log file and the filter to see if it works and how many matches are found in the log. To do this, run the fail2ban-regex command:

fail2ban-regex  <apache access.log naplófájl elérése> /etc/fail2ban/filter.d/apache-mycustombots.conf

I have this filter output on one of my web pages:

Running tests
=============

Use   failregex filter file : apache-mycustombots, basedir: /etc/fail2ban
Use      datepattern : Default Detectors
Use         log file : /var/www/xxxx.hu/log/access.log
Use         encoding : UTF-8


Results
=======

Failregex: 17836 total
|-  #) [# of hits] regular expression
|   1) [17836] ^<HOST> .*(GET|POST|HEAD).*(SemrushBot|AhrefsBot|Mb2345Browser|MegaIndex\.ru|MJ12bot|DotBot|Baiduspider|YandexBot|LieBaoFast|zh_CN|zh-CN|SeznamBot|trendictionbot|magpie-crawler).*$
`-

Ignoreregex: 0 total

Date template hits:
|- [# of hits] date format
|  [19797] ^[^\[]*\[(Day(?P<_sep>[-/])MON(?P=_sep)ExYear[ :]?24hour:Minute:Second(?:\.Microseconds)?(?: Zone offset)?)
`-

Lines: 19797 lines, 0 ignored, 17836 matched, 1961 missed
[processed in 3.48 sec]

Missed line(s): too many to print.  Use --print-all-missed to print all 1961 lines

I highlighted the important line in green: here I just found an 17836 robot match in today's Apache log on the website. Of course, this does not mean as many page downloads, but here, each small resource request is listed as a separate line in the log, as we saw above for the .png file retrieved from phpMyAdmin. In any case, this is not the case at mid-day, especially if you note that the "Missed lines" row below only contains 1961 results, which means non-robot traffic. So this site has a proportion of 10x more robots than a real visitor. This is quite a waste, so turn on the filter.

Turn on filter

To activate the filter, open the /etc/fail2ban/jail.local file:

nano /etc/fail2ban/jail.local

And let's include the following:

[apache-mycustombots]
enabled   = true
port      = http,https
filter    = apache-mycustombots
logpath   = /var/log/ispconfig/httpd/*/access.log
findtime  = 3600
maxretry  = 1
bantime   = 86400

One earlier the meaning of these has already been discussed, but we are going to review them here:

  • enabled: Here you can turn on the filter.
  • port: Apply blocking to these predefined ports in Fail2B.
  • filter: This is where you enter the name of the filter file you created earlier (without the extension).
  • logpath: Enter the name of the log file you want to scan here. In this case, Apache access.log file.
    For LAMP servers, specify the path mentioned at the beginning of the description, and for custom settings, the path to the configured access.log file.
    Here, I show an example of an ISPConfigos server environment where you can use the "*" wildcard to specify the apache log file for all web pages at once. Of course, you can even specify a log file for a specific website here, if needed. But this way we can protect all the websites on the server, including the websites created after that.
  • findtime: This is the time window (in seconds) in which you monitor the results. So if you set it to 3600, it only looks at the contents of the last 1 hour in the log file.
  • maxretry: By default, this was invented, for example, to limit the recurrence of log events generated by errors in various server services or lost passwords, and here it means limiting the occurrence of a particular robot.
    So here, too, set a value the number of times you want to allow the robot to pop up from a particular IP address. The next time the ban comes. So one day you should definitely be logged before disabling Fail2Ban.
  • bantime: This is the time of the ban. I set a round here.

We can experiment with the last 3 parameter to achieve the right effect for us, depending on the amount of traffic on our web pages, etc.

If you have saved this, restart fail2:

systemctl restart fail2ban

From here, the system nicely blocks the IP addresses of unwanted robots.

 

 

Control

It is always a good idea to check our fail2 service after activating a new filter:

systemctl status fail2ban

Normal output for normal operation:

 fail2ban.service - Fail2Ban Service
   Loaded: loaded (/lib/systemd/system/fail2ban.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2019-11-27 12:34:55 CET; 2h 21min ago
     Docs: man:fail2ban(1)
  Process: xxx ExecStartPre=/bin/mkdir -p /var/run/fail2ban (code=exited, status=0/SUCCESS)
 Main PID: xxx (fail2ban-server)
    Tasks: 21 (limit: 4915)
   Memory: 85.4M
   CGroup: /system.slice/fail2ban.service
           └─xxx /usr/bin/python3 /usr/bin/fail2ban-server -xf start

nov 27 12:34:55 szerver systemd[1]: Starting Fail2Ban Service...
nov 27 12:34:55 szerver systemd[1]: Started Fail2Ban Service.
nov 27 12:34:57 szerver fail2ban-server[27518]: Server ready

In addition, we can check our new jail separately fail2ban-client command:

fail2ban-client status apache-mycustombots

The output, also on my server:

Status for the jail: apache-mycustombots
|- Filter
|  |- Currently failed:	0
|  |- Total failed:	1214
|  `- File list:	/var/log/ispconfig/httpd/xxx/access.log /var/log/ispconfig/httpd/yyy/access.log /var/log/ispconfig/httpd/zzz/access.log ...
`- Actions
   |- Currently banned:	15505
   |- Total banned:	16690
   `- Banned IP list:	1.180.164.122 1.180.164.125 1.180.164.128 ...

Here, you can see just how many new IP addresses are being "monitored" that have not yet been blocked (1214), which, if any Apache retrieval occurs, will be blocked. Also, the 15505 IP address is currently blocked, and below it altogether. These are older than 24 hours, which are already unlocked.

Advantages disadvantages

Advantages:

  • It provides a global setting so you don't have to set it up as a separate web page.
  • Disabling is still occurring in the firewall, so the incoming request is no longer passed to Apache.
  • Depending on how you disable the firewall, the firewall can simply drop the connection without responding. That way, tryers or robots may think that the server is down, so they move away over time ...
  • Blocked IP addresses do not reach Apache during blocking, so they do not increase the log file size by adding additional log entries, which reduces the load on Fail2B.

Disadvantages:

  • For thousands of thousands of 10 thousands of IP addresses, the constant application of filter rules created in the firewall (not in Fail2Ban) can put a strain on the CPU.
  • It is more complicated to set up and requires root access, so custom webmasters do not do it.

 

Disable bots using Apache .htaccess files

Apache .htaccess files I think everyone knows, so they don't need to be presented. The setup here is infinitely simple, just open or create the .htaccess file in the root directory of the web page you want to protect, and include the following section, which also contains our previous robot filter line:

RewriteEngine on

# Felesleges robotok tiltása.
RewriteCond %{HTTP_USER_AGENT} ^.*(SemrushBot|AhrefsBot|Mb2345Browser|MegaIndex\.ru|MJ12bot|DotBot|Baiduspider|YandexBot|LieBaoFast|zh_CN|zh-CN|SeznamBot|trendictionbot|magpie-crawler).*$ [NC]
RewriteRule .* - [F,L]

Then save it and the filter is up and running. The point here is that if .htaccess already contains the RewriteEngine on then insert these rows below it.

In such cases, Apache "rejects" the request from the website with a forbidden (HTTP response code 403), but a 500 error may also occur.

Let's also look for banned entries in the log file, which were displayed by the system with the error code 403 or 500. Such requests are below grep vagy gooseberry we can filter it out in the log file with commands (whichever is more sympathetic):

cat access.log | grep " 403 \| 500 "
cat access.log | grep -E " 403 | 500 "
cat access.log | egrep " 403 | 500 "

If in the output you get here only 403If we find codes for our banned robots, then we are fine. Some example:

114.119.146.18 - - [21/Sep/2022:23:38:13 +0200] "GET /enciklopedia/g HTTP/1.1" 403 6774 "https://www.linuxportal.info/index.php/enciklopedia/m/matomo-piwik-webanalitika?amp" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
54.38.96.32 - - [21/Sep/2022:23:50:12 +0200] "GET /robots.txt HTTP/1.1" 403 6767 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
216.244.66.248 - - [21/Sep/2022:23:50:13 +0200] "GET /robots.txt HTTP/1.1" 403 6728 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
5.196.175.154 - - [21/Sep/2022:23:56:11 +0200] "GET /taxonomia/samba HTTP/1.1" 403 6774 "https://en.linuxportal.info/tutorials/filesystem/how-to-share-directories-on-linux-and-windows-systems-page-2" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"

If, on the other hand 500We see error codes with robots, for example:

216.244.66.248 - - [20/Sep/2022:01:26:03 +0200] "GET /robots.txt HTTP/1.1" 500 5117 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
87.250.224.29 - - [20/Sep/2022:04:30:59 +0200] "GET /robots.txt HTTP/1.1" 500 5163 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
116.179.37.84 - - [20/Sep/2022:16:39:50 +0200] "GET /sites/default/files/theme/linux-logo_kicsi.png HTTP/1.1" 500 5674 "https://en.linuxportal.info/" "Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)"

Then there is still a small beauty flaw, which is not a problem, but if we want to be precise, we can find out how to fix it in the description below:

 

Advantages disadvantages

Advantages:

  • This does not create a new firewall rule for each IP address, so it does not overload the firewall with a larger number of IP addresses.
  • It is easy to set up and does not require root access, so custom webmasters can do it on their web pages.

Disadvantages:

  • There is no option for a global setting, but it should be done separately for each web page. If you want to configure it globally, you can only configure it in Apache with root access.
  • The ban is executed by Apache, and so is the load. Also, Apache processes are reserved, even for short periods.
  • Since Apache returns a "Forbidden" response, attackers / robots will be aware of how the server is running, so they can try harder.

 

 

analyzes

Using the above two methods, I have had useful results since yesterday, so now I'm sharing them.

I have the Munin server resource measurement statistics installed on my server, which will be covered in another description, just the graphs I use here.

Immediately after the launch of Fail2Ban yesterday, the number of blocked IP addresses started to increase:

Munin - Fail2Ban graph 1.

Here you see it apache mycustombots the addresses in our filter started jumping. And a little later, the CPU graph:

Munin - CPU Graph 1.

Here was a big CPU gain between 6 and 12 in the morning. After that, though, the robot traffic subsided, but after that I set up the filter.

Then I added the filters to the .htaccess files shortly afterwards, so the two had a mixed effect. Today's Fail2Ban graph looks pretty interesting:

Munin - Fail2Ban graph 2.

What happened here was that the number of blocked IP addresses reached an interesting level of 16 and then began to flatten out. I attribute this to the fact that the operators of many larger robots have an IP address range with 16 IP addresses. Or if they have a larger province, they can come to this region with that much capacity. Thus, a large average of robots could start running out of free titles, because the others were still banned at the time.

By the way, if we retrieve the IP addresses of the robots with the whois command, we will also get the IP address range in the service provider's data, or the CIDR value as well, of which for example on this page we can calculate the number of IP addresses in the domain. All this is useful only to assess the capacity of our favorite robots.

A little time later, I see a sudden drop when I restarted Fail2Ban today because I had added a few more robots to the filter. And then it looks like the rebound was a bit slower, so it took some time for Fail2Ban to repackage the filter rules in the nftables-be. Then he returned to his previous level, and when he reached the 24, he began to descend slowly. This is the amount of bantime I have set for Jail, so expired bans will be lifted. This will make a nice turn of the cycle.

During this time, the CPU was also interesting:

Munin - CPU Graph 2.

Using yesterday's Fail2Ban filter and .htaccess files together also paid off: everything has subsided. There was a small spike after that, which was due to something else, but then you can see the smoother jump caused by restarting Fail2Ban. So with this amount of IP address, there is already a small load on the reboot. As can be seen in the graph above, it took time to pack the addresses back into the firewall filter rules. Then there was a lower but more lasting increase in CPU load, which is still going on.

 

 

Conclusion

In summary, it is worth keeping in mind the amount and composition of robot traffic, the amount of websites and their traffic, and the capacity of our server when setting these filters. Disabling a large number of IP addresses can lead to a situation like this, which increases our CPU load, especially for a weaker server with fewer CPU cores and high load values.

Of course, there is also an effective way to handle high-volume IP address blocking Ipsen and even there is a way to do all of this you can use it with our Fail2Ban system. We will have another opportunity to get acquainted with all this in a later description. In the meantime, we monitor our server resources and make optimal use of both filtering methods.