How to defend against attacks resulting in large volumes of 404 or other 4xx HTTP error codes with Fail2Ban

botond published 2023/03/14, k - 00:38 time

Content

 

Introductory

Our websites are constantly under attack from the outside world. The vast majority of these are done by robots, which try to discover the weak points of the websites running on the server. Some of the robots with this purpose try to do this by making various seemingly random HTTP requests to our websites, most of which are directed to non-existent URL addresses. As a result, our server responds with a 404 HTTP response code. This would not be a problem by default, however, such attack attempts usually occur in short time intervals, during which a very large number of requests are sent to our server, which also results in a large number of 404 or sometimes other 4xx response codes. However, a large amount of requests can already cause load spikes, which can lead to an unwanted decrease in server performance.

I have already prepared a similar description in which we banned robots using Fail2Ban and Apache's .htaccess files, in this
and in the description, we will see how we can ban attempts that arrive in large quantities, resulting in 404 and other 4xx HTTP response codes, also in the Fail2Ban It helps.

The following description contains relatively long theoretical parts compared to the task to be performed. For those who are in a hurry, or who are advanced and don't want to read through these theoretical parts, they can skip to the advanced part, which is a short summary at the bottom of the full description and contains only the commands and settings necessary to complete the task. However, if these parts are self-explanatory, or the implemented parts do not work for some reason, it is recommended to read them from the beginning. Jump to advanced!

 

 

How they attack our server

In such cases, the party carrying out the attack usually tries to map the perceived weak points of our websites, which it tries to achieve by researching known file and directory patterns and trying to determine the type of our system, thereby e.g. CMS in the case of systems, to assess their security deficiencies. This phenomenon is usually a minor part of a more comprehensive analysis, where, for example, DNS inspections are also usually carried out, or even one port scanning are also implemented, with which the opens our ports is checked. While the first activities are harder to catch, these high volumes of 404 and other 4xx HTTP error codes are the most telling signs of this type of attack attempt, and we can easily filter and block them.

In the picture below we can see one from yesterday  Apache log detail, where these requests resulting in 404 and other 4xx HTTP response codes are filtered:

Large amount of 404 error codes in the Apache log

As you can see, these requests arrive here in a very short time in a very large amount. Although the beginning of the series is not visible in the picture, they generated a total of more than 5 Apache requests in the course of about 10-6000 minutes yesterday. Which, if you take a closer look, the vast majority of these are 404 errors:

Large amount of 404 error codes in the Apache log

If we use some kind of system resource monitoring software, such as Munint, then you can clearly see the spike in the middle of the Apache graph that stands out from its surroundings:

In the center of the Apache graph is the load spike

According to the graph, the number of requests in this 5-minute cycle reached 20 requests per second. If we were to project it for, say, an hour, it could be more than 72.000 requests. Of course, luckily I noticed it in time and as a temporary solution I banned the IP address in the UFW firewall. Who knows how long he would have come if I didn't ban him.

Fortunately, these requests resulting in such 404 HTTP error codes do not typically stir up a lot of dust around the CPU load, because as we can see on the CPU graph, there are no spikes in that time band. What peaks here every 12 hours is the load generated by the internal caching mechanisms of WordPress pages (WP Super cache plugin):

There is no increase in the CPU graph during this period

In case of 404 errors, it does not even run by default PHP, nor a database query, so the load is also low, since in this case only Apache returns the request with a 404 response code. However, if we use individual error pages that, for example, point to or redirect to the error page of our website, depending on the operation of the page, you can screw up PHP and the database, if it is, for example, an internal error page of a CMS system. Therefore, it is definitely worth considering how much of a burden - and thus a danger - these seemingly harmless requests can pose in terms of our own circumstances. Later, depending on these, we will have to determine the severity of the ban, etc., but more on that later.

Temporary solution

As a temporary solution, the UFW ban command is simple for quick situations:

sudo ufw insert 1 deny from 34.64.0.0/10

Of course, this is not the right solution, I only temporarily stopped the number of downloads at about 6300, but as a permanent solution, we will create a Fail2Ban Jail, which will automatically handle these attempts as well. After that, we can remove our manual ban, because normal visits can also come from a larger IP range (of course, a hosting provider about your IP addresses there is little chance of this, because robots are usually launched from such places).

 

The interesting thing is that the browser User-Agent in the requests shown above is "cyberscan.io", which if you visit it, it is an analysis site dealing with security matters, so it is likely that someone could have entered the name of the website here before it started your robots. And the IP address is part of the Google Cloud network, if we get a whois for it, so they rent the IP addresses from them. In this particular case, it was a harmless thing that was launched from such a "security check" page. However, these should not be taken lightly, because much more aggressive attempts can arrive at any time, which can cause serious loads to our server. To prevent this, let's see what we can do for our protection.

 

 

Preventing the attack with Fail2Ban

For those who are not yet familiar with the Fail2Ban program, it is only worth knowing briefly that it is perhaps the most effective protection tool used in the operation of web servers. Its operating principle is that it counts the hits by looking for patterns in the various log files - defined by regular expressions - and then blocks the IP addresses in question with firewall rules when the specified limits are reached. Then, after the also set time has passed, it unlocks the disabled clients. Thus, the "circulation" is continuous, no permanently blocked IP addresses remain, and freshly analyzed suspicious cases are immediately banned. So its greatness lies precisely in its simplicity. If it is not already on our server, let's start with its installation.

If for example ISPConfigIf we are using a server environment, then this protection tool is already installed on our server, in this case, skip the installation and the rotation of the log files and continue with the setting.

Install Fail2Ban

To install Fail2Ban, run the command below root-Kent:

apt-get install fail2ban

Rotate log files

Fail2Ban produces large log files over time, so it is highly recommended that you also rotate the log files. We will not go into this here now, because this is not the topic of this description, and we have already dealt with it in several other descriptions, which can be viewed at the following links:

Fail2Ban setup

There are two things you need to do to set up Fail2Ban:

  • first create a filter file a /etc/fail2ban/filter.d/ directory
  • then you need to set the jail a /etc/fail2ban/jail.local file that applies the filter to the corresponding log file and runs the jail
Before we start with these settings, it is important to note here that the Apache log files you want to monitor and check exist in several places and in several formats, thanks to which you can configure Fail2Ban in several ways. 
Here, we can now view two types of settings, of which we use the one that is more sympathetic and of course also available in their system.
I present the sections containing the following settings below in two, so that the process and why of the settings are more clear and understandable.

Before presenting each setting mode, let's take a look at the image below, where 4 terminals are visible at the same time, where the difference between each log file can be more clearly seen:

Apache log file types

I made this on my other VPS where the first three terminals show different Apache log files and the fourth one shows Fail2Ban's log.

Don't be fooled by the sometimes incorrect IP address highlighting here. The color highlighting of the IP addresses is done by the MobaXerm (SSH client and terminal) that I use, which sometimes recognizes other data as IP addresses, here, for example, browser identifiers (User-Agent) contain version numbers similar to IP addresses.

Here, I prepared the settings in advance, which I tested from a mobile phone, mobile internet, so that you can see how things work and make it easier to understand the settings that follow.

Two from the mobile phone Virtual hostingI also tested the generation of 404 errors on:

  • at the address available with the full host name (vps.linuxportal.eu), from which the various web applications can be accessed, e.g. ISPConfig, webmail, phpMyAdmin, etc.
  • as well as on the (currently) demo hosting on this VPS (linuxportal.eu), where a plain web hosting is available.

It is important to note here that these attacks can arrive anywhere (any virtual host), so not only our normal websites can be attacked, but the full hostname (FQDN) also available under our web applications. Accordingly, we can expect such mass 404 errors on any virtual host, which is why I also tested both locations currently available on the VPS:

Test for 404 error under fully qualified hostname (FQDN).

Under the full hostname a on port 8081 (port 8081 will be important later) I tried to load /aaaaaa, which obviously gave an error. In the above 4-part terminal image, the upper left part shows the /var/log/ispconfig/httpd/vps.linuxportal.eu/access.log by checking the file. And the other:

Testing the 404 error under normal web hosting

an incorrect attempt under the normal web hosting, where I tried to access the path /bbbbbbbbbb, of course with an error here as well. Here, however, the unique error page of the ISPConfig control panel comes into play, which is set by default on the created web hosts. The result of this can be seen in the upper right terminal on the split screen /var/log/ispconfig/httpd/linuxportal.eu/access.log by viewing the file.

And in the Fail2Ban log, you can see how it blocked the phone's IP address after a few attempts, after which nothing was loaded.

Knowing these, let's now see the structure of the various log files, so from here the Fail2Ban setting branches into two options.

Search for IP addresses at the beginning of lines

In the case of the first variation, we use those Apache log files where the lines start with IP addresses, so we have to examine them in the first column. In the case of the ISPConfig server environment, such log files are already shown above /var/log/ispconfig/httpd/ under the directory there are additional subdirectories:

Examining log files

(In the case of web hosting, the same log files are set by ISPConfig bind mount are also available in the structure of the web hosts, but we will not deal with this right now)

So here in the picture you can see the following:

  • /var/log/ispconfig/httpd/ directories containing the log files of the virtualhosts (the web hosting(s) and the virtualhost associated with the full hostname) in the directory
  • Access.log files in directories
  • as well as a brief look at the access.log file in both directories, where we can see the error tests made with the mobile phone above, filtered for 404 errors, each in the access.log file in its own directory.
Due to the live server, during the tests, an erroneous request slipped in from elsewhere, which was directed to a non-existent /info.php. Presumably, this could not have come with pure intentions either: during installations or developments, many people forget their phpinfo() files that generate PHP outputs, which they like to search for during such attempts. Furthermore, the use of the HTTP/1.1 protocol indicates the robotic nature of the visit, as all normal web browsers already support HTTP / 2 use, so regular visits are all done with the newer protocol.
All this clearly illustrates that there are continuous attempts even on this small "hidden" (only used for writing articles here) VPS.
Filter file setting

If therefore a /var/log/ispconfig/httpd/ we want to set our filter with log files under the structure, then create a file a /etc/fail2ban/filter.d/ library, which I now apache-4xx.conf I name it:

nano /etc/fail2ban/filter.d/apache-4xx.conf

And let's put the content below.

[Definition]
# IP-címek keresése a sorok elején:
failregex = ^<HOST> -.*"(GET|POST|HEAD).*" (400|401|403|404) .*$

# Kivételek:
ignoreregex =.*(robots.txt|favicon.ico)

And save it.

Short explanation:

  • Here, the ^ entered at the beginning of the failregex section indicates that the log lines start with the IP addresses, so if the line is banned, it is used by Fail2Ban.
  • In the middle part, we also examine GET, POST and HEAD requests
  • Finally, we take into account the occurrence of any of the listed error codes
  • And for the exceptions, we can specify some file types that do not have to be there, but, for example, search engine robots can sometimes search there. To avoid such errors, we include them in the exceptions.

 

 

Jail setting

If our filter file is ready, then the jail must be configured. To do this, open the /etc/fail2ban/jail.local file:

nano /etc/fail2ban/jail.local 

If this file hasn't existed yet, then add it, if it already exists and has something in it, then add the following content to the end:

[apache-4xx]
enabled = true
port = http,https,8080,8081
filter = apache-4xx
logpath  = /var/log/ispconfig/httpd/*/access.log
findtime = 60
maxretry = 5
banaction = %(known/banaction)s[blocktype=DROP]
bantime = 3600
ignoreip =

Explanation:

  • [apache-4xx]: This is the name of our Jail. It can be anything, but it should not conflict with the name of other jails, and it is also advisable if the name is the same as the name of our filter, so it is easier to maintain later, etc. Since I gave my filter this name, I use it here as well. Of course, this is not mandatory, but it makes sense to use it this way.
  • enabled: Boolean switch: whether jail is enabled.
  • port: Here you must enter the monitored ports, listed with commas. In addition to the usual "http" and "https", here we also specify ports 8080 and 8081, thereby protecting our ISPConfig control panel on port 8080 and web applications on port 8081 (phpMyAdmin, webmail, Munin, etc.). Of course, if we use individual ports instead of these, we will specify them here.
  • filter: Here we must enter the exact name of our filter file without the .jail extension. I use the name "apache-4xx" here, which Fail2Ban will search for as "apache-4xx.conf" /etc/fail2ban/filter.d/ in a library. An exact match is therefore required here.
  • logpath: Access to the log file must be specified here. You can also replace it with the "*" character. Accordingly, the above setting exactly matches the Apache log file structure discussed here, with which the access.log files of all virtualhosts are examined. We can see how it works in the lower right terminal of the previous split-screen terminal image, where the two "Added logfile" lines show the used log files.
  • findtime: The "time frame" monitored during hits seconds specified value. So, the set below must be fulfilled within this time frame maxretry (number of attempts) value. If not specified, the default is 10 minutes (10m). This default setting is a /etc/fail2ban/jail.conf is defined in file.
  • maxretry: Number of attempts. This number of occurrences can be a findtime within a time window. If this number is reached, the IP address will be banned. Default value is 5.
  • banaction: Action performed when the ban takes effect. You don't have to enter this. If it is not specified, Fail2Ban places a REJECT-type blocking rule in the iptables firewall by default. I, on the other hand, prefer using DROP, so I include it. It is about their differences In case of UFW willow wall I wrote earlier (Az UFW The DENY policy used in the case of a firewall corresponds to the DROP policy of iptables). So it's optional, but I prefer that you drop the contact altogether and don't fumble with different replies.
  • bantime: Ban duration seconds given. If not specified, then by default this is also 10 minutes ("10m")
  • ignoreip: This will be needed for whitelisting; the IP addresses listed here are not blocked by Fail2Ban. Since this is one very important part for the entire setting, we will discuss this in a separate chapter after the settings, until then we will leave it blank.
What is important about this setup section is that the log files are a /var/log/ispconfig/httpd/ are added from under the structure, where a separate log file is available for each virtual host. This places the IP addresses at the beginning of the log lines, as opposed to the next variation. So here, based on this, the jail settings process has been divided into two.
It is also important to note here that these log files are only present in server environments with ISPConfig, so for example they do not exist in the case of a LAMP or other servers, as they are managed by ISPConfig.

When using ISPConfig, we can also choose this setting, so all of our virtual hosts are protected. So in the split screen image above, this filtering method filters the log files displayed in the two upper terminals. Of course, if there are several web hosts on the server, then all of them in the same way.

I will write about further fine-tuning of the jail after the other variation, since it will contain settings valid for both.

And now come the other setting variation.

Search for IP addresses inside lines

This setting is very similar to the previous one, only it differs from the previous one in the following: Instead of separate Apache log files per virtualhost in the previous variation, here /var/log/apache2/ in the directory other_vhosts_access.log file, which is also Fail2Ban's default Apache log file. This log file is an aggregated log file, which - in contrast to the previous ones - collects the apache data of all virtual hosts in one log. Using the file is especially convenient when, for example, we want to follow all the activities on the web pages running on the server on a terminal, as well as the additional activities on the web applications available on the full hostname. This file therefore includes all virtualhost data. An example of this can be seen in the lower left terminal of the split terminal image above, where the tests made with the phone are performed all of them appears in the same file.

But here I am showing a more recent picture of this, where I have also filtered for 404 errors:

The other_vhosts_access.log file

It looks better on a bigger screen.

Here we can again see "foreign" requests at the beginning, which arrive with the already mentioned HTTP/1.1 and apparently send random requests, but there is also a "/favicon.ico" request, for example, which was not accidentally added to the exceptions of the filter above, because these requests can mostly be linked to some search robots, which are looking for the icon file of the website with harmless intentions. More recently, for example, the Google search engine also displays them in its list of results. Therefore, it is advisable to set them in the "ignoreregex" part of the filter. But the other foreign requests came with at least dubious intentions.

As well as the "/aaaaaaaa" and "/bbbbbbbbbb" requests shown in the lower parts, they are also me here from my mobile phone and mobile internet. Here, HTTP/2 also shows that these faulty requests were made from a regular browser.

So the point here is that everything appears in one file, so for example this is my favorite Apache log file, which is constantly "running" on one of my monitors with the tail -f command. That way, I can always see from the corner of my eye when something abnormal is happening on the server while I'm working. Of course, if a person goes out to make coffee and is hit by a large amount of 404 tsunami that he doesn't see, we are now preparing our Fail2Ban solution for such cases. :)

The interesting thing about the log file, however, is precisely what it is here for several virtual host data are placed in one file, thus, in the case of the previously described log files, instead of the IP addresses at the beginning of the lines, our own host names are included in this log file, which indicates on which virtual host (storage sites or applications) the particular request was made. Here, the domain names of the storage spaces on our server can appear in the first column, as well as the FQDN name of the server, if we open, for example, phpMyAdmin or webmail through it.

As a result, the IP addresses of the clients were placed in the "second column" compared to the previous log, as a result of which our previous regular expression used in the filter would not find the IP addresses, but would constantly try to block our own hostnames, since those are the lines here now at the beginning (the "^ thanks to the " section).

So here we have to make a small modification to our filter and then to the jail logfile setting. Let's see these!

Filter file setting

In this case too, we create our filter file a /etc/fail2ban/filter.d/ in the library, which is still the case apache-4xx.conf I name it:

nano /etc/fail2ban/filter.d/apache-4xx.conf

And let's put the content below.

[Definition]
# IP-címek keresése a sorok belsejében:
failregex = .* <HOST> -.*"(GET|POST|HEAD).*" (400|401|403|404) .*$

# Kivételek:
ignoreregex =.*(robots.txt|favicon.ico)

And save it.

Here, the difference compared to the first variation is that the regular expression entered in failregex does not search for IP addresses at the beginning of the lines (^), but at the beginning of the line there can be any contiguous part (.*), separated by a space, followed by the IP address (.*). so it will look for the IP addresses of the clients after the character string containing our own hostnames and port numbers at the beginning of the line. Apart from that, there are no other differences compared to the previous filter file.

Jail setting

If we have saved the filter file, then the jail must be configured here as well. For this, let's also open for editing /etc/fail2ban/jail.local file:

nano /etc/fail2ban/jail.local 

If this file hasn't existed yet, then add it, if it already existed and there is something in it, then add it to the end, and if we tried the first variation here, then overwrite it and add the following content:

[apache-4xx]
enabled = true
port = http,https,8080,8081
filter = apache-4xx
logpath  = %(apache_access_log)s
findtime = 60
maxretry = 5
banaction = %(known/banaction)s[blocktype=DROP]
bantime = 3600
ignoreip =

Here, almost everything is the same as in the case of the first variation, only the logpath setting differs. So here it is not the /var/log/ispconfig/httpd/*/access.log files with our jail, but we specify an internal variable in Fail2Ban that contains the default log file(s), which in this case consists of the following two files:

  • /var/log/apache2/other_vhosts_access.log
  • /var/log/apache2/access.log

The images below show how this setting works:

Fail2Ban - Another setting

Here, moving in line through the small terminals, we can see the following:

  • top left: The apache-4xx.conf file, where I only change my active settings with the comments, so the state here corresponds to the filter setting just described, which does not look for IP addresses at the beginning of the line.
  • Upper right: The jail.local configuration file, where I also switch between the settings with comments. Here, too, the "%(apache_access_log)s" logpath just described is the valid setting.
  • left down:/var/log/apache2/access.log file content, just for curiosity, as it is also used by the logpath setting. However, this file does not contain client IP addresses, so it is not relevant here for our jail. If we check its format, it does not interfere with the operation of our filter, because it ignores the "127.0.0.1" IP addresses at the beginning of the line, since this filter setting looks for IP addresses in the "second column". Since we do not extract IP addresses from this, this log file is currently useless for us, but it does not disturb our filter, so we do not deal with it further.
  • Bottom right: And here you can see the Fail2Ban restart.

The following screenshot, also shared, shows the operation after the restart:

Fail2Ban - Another setting in action

And here we can see the following:

  • top left: A /var/log/ispconfig/httpd/vps.linuxportal.eu/access.log file that you now with this jail setting we are not paying attention, but the data is collected even in this. My own tests are shown here again.
  • Upper right: Also based on our current setup not used /var/log/ispconfig/httpd/linuxportal.eu/access.log file, but this also causes the data to be collected in the same way. Here, too, are the tests about the incorrect requests arriving at the storage.
  • left down: The log file /var/log/apache2/other_vhosts_access.log, which is now monitored in this setting, is what is "happening" here. So here you can see the incorrect requests arriving at both virtual hosts. In the case of multiple storage locations, all relevant data will of course be displayed in the same way.
  • bottom right: Here is the /var/log/fail2ban.log file can be seen with the newly launched jails, where we can see the two log files used, which I have already listed above for this setting, as well as the results of my mobile phone tests, as it disables the phone's IP address after the 5th attempt.

Here you can see how this setting works.

 

 

IP address resolution

If we ban our IP address during experiments and tests - which is precisely our goal here, to test the working Fail2Ban filters and jails - then we can solve this simply with the following command:

fail2ban-client unban ip <ip-cím>

The command unblocks any jail, so in our tests it is the fastest unblocking option. It can also be seen in the picture here after a restarted state, where it restored (re-registered) my banned IP address, which I then unblocked:

Fail2Ban - Unlock IP Address

I already wrote a more detailed description about unlocking the banned Fail2Ban addresses:

 

Jail tweaking

As I promised above, here we say a few words about the fine-tuning of our jail. I don't want to repeat the parameters of the jail here, I would just summarize in a few sentences what should be kept in mind when before live use fine-tune these parameters:

A findtime (time frame), maxretry (number of attempts) and a bantime (ban duration) relationship should be set in such a way that we first check the traffic of our server, the traffic of visitors to our websites, etc., and adjust these settings accordingly. Of course, a completely in-depth analysis is not necessary here, especially if we know the traffic of the websites on our server, but if we do not know them, then we just monitor it for a short time with the following command:

tail -f /var/log/apache2/other_vhosts_access.log | grep " 404 "

And we will see how strict settings we have to apply. But if we don't want to monitor and bug the log files, then we should only consider logic and common sense. Some aspects:

  • Note that there will be visitors who arrive at the wrong URL, resulting in a 404 error. In such cases, they may either arrive at a wrongly received link, or someone may even type the subdirectory in the address bar (yes, there are people who reach into the address bar and rewrite things).
  • Please note that our websites are also indexed by search robots (they are the good robots, e.g. Google, Bing, Yahoo, etc.), and these robots quite often arrive at wrong addresses, which also result in 404 errors.
    • For example, in the case of dynamic websites, if search engines have indexed our pages, and later, for example, we delete content that is no longer needed, or comments, etc., then these will still be "searched" by robots and visitors for a while.
  • If we use some kind of cloud-based network that retrieves our content acting as a proxy, many people will arrive from their IP addresses, so in this case even several users can arrive from the same IP address, which increases the number of requests per IP address, thus also the number of 404 errors.
    For example, the English translation of my site is done by an external system through a proxy system, so a lot of people come from their IP addresses.
  • White listing. There are some networks that we do not want to ban under any circumstances. For example search engines and other partners, etc. We will discuss this in more detail below.

So when we set these parameters, keep things like this in mind.

It is advisable to start with a more relaxed setting and occasionally look at our Fail2Ban jail, or if we have, for example, Munin, we can monitor the evolution of the jail there as well, and depending on this, it can be further tightened.

It's not worth saying more about this, so let's decide based on common sense and set these parameters according to the traffic of our own server. What is also important to mention here is a white listing. We already mentioned this in the explanation of jails, but since this is a very important part, we will review it in a separate chapter.

 

Whitelisting - or those we never want to ban

Before we even harden our jail, we still have to review a very important part, namely the whitelisting. This is at least as important a part as setting the ban itself. Due to its importance, I have highlighted it in this separate chapter. If we don't prepare this properly, even in the medium term we can cause more damage to our websites with bans than if we had done nothing. Therefore, it is important to define precisely the groups that we do not want to ban from our websites under any circumstances. Let's collect these:

  • Search robots. Typically, they are robots from well-known search engines. Everyone should know this from their own traffic, which are the robots that deliver the most traffic to our websites. For me, for example, the distribution of traffic mediated by robots in order (analyzing a 1-year time frame): 
    Google (91,7%), Yandex (5,1%), DuckDuckGo (1,2%), StartPage (0,6%), Brave (0,6%), Bing, Baidu, Facebook, Qwant
    Here, of course - in search engines under Google - the distribution of robots strongly depends on the language, topic, etc. of our website. Google usually keeps such top lists fixed everywhere, but the ones that come after them may differ from page to page.
  • Robots of social sites. If we expect traffic from different social media sites, we must obtain their IP addresses as well, because they often use their own robots, which check, for example, the images and meta tags available at the source URL addresses of the shared content, so they can also visit our pages frequently. . Even if we are not big fans of social media ourselves, our visitors can still share our content they like on social media.
  • Cloud-based services, which forward the traffic acting as a proxy. Eg CloudFlare, etc. These providers usually publish a list of their IP addresses. I don't have one, but anyone who uses something like this should check with the service provider.
  • Other partners, who act as a proxy and forward the traffic to us. In my case, for example, the English translation of en.linuxportal.info is done by such a company, so those who come to the English-language pages come to their servers, and then the requests are made from their servers. Thus, by default, the IP addresses of the translation company appear in the Apache logs. This company provides me with the IP addresses they use, so they can be easily whitelisted.
  • API-based services. This could be, for example, a PayPal or other payment gateway, which must be able to communicate back and forth with our server, which are typically HTTP/HTTPS they take place in ports. If these are properly set up and working, then they do not usually generate 404 errors, but if any error slips in and a request "slips to the side", then of course they can also be banned. So if we use things like that, we consider that as well.
  • My Fixed IP address. If our own Internet service at home uses a Fixed IP address (because we requested it from the service provider, etc.), then it is recommended to set this as well, so that we do not accidentally get banned.
  • Other. All other cases that are not based on public services, for example, but, for example, known IP addresses of another company must make requests to our websites. We also have to collect these addresses ourselves.

All of these may seem like a lot at once, but I listed them because I don't want someone to set up the Fail2Ban ban described here, and then ban half of the internet from their websites as a result of a bad setting. So, knowing these, let's use any blocking mechanism, whether from here or a solution prepared based on a description found elsewhere.

White list setting

The white list, i.e. the exceptions, can be set with the "ignoreip" parameter of the jail settings already presented above. Example:

ignoreip = 192.168.1.1 192.168.2.0/24 10.0.0.1

Here you can enter several IP addresses, or CIDR (Classless Inter-Domain Routing) address ranges. If you want to specify more than one item, you can separate them with spaces. If we have a lot of items and we want to make the list clear, we can break them into several lines. This can be done with the characters "\" (a space is required after the \ sign). Example:

ignoreip = 192.168.1.1 192.168.2.0/24 \
           10.0.0.1

Set the required IP addresses accordingly.

 

 

Processing of IP addresses stored in JSON format

Many large service providers publish their own IP addresses in JSON data structures, on which the outside world can access their own services. Examples include the following:

Etc. If you are interested in the IP addresses of the robot groups listed at the beginning of the chapter, you can find them at the respective service provider.

Here, for the sake of example, we will look at the list of Google robots and how we can process it automatically to get a list separated by spaces.

If we click on the link of the list containing the robots, we will get a content like this:

{
    "creationTime": "2023-03-07T23:03:55.000000",
    "prefixes": [
        {
            "ipv6Prefix": "2001:4860:4801:10::/64"
        },
        {
            "ipv6Prefix": "2001:4860:4801:11::/64"
        },
[...]
        {
            "ipv4Prefix": "34.100.182.96/28"
        },
        {
            "ipv4Prefix": "34.101.50.144/28"
        },
        {
            "ipv4Prefix": "34.118.254.0/28"
        },
[...]
        {
            "ipv4Prefix": "66.249.79.32/27"
        },
        {
            "ipv4Prefix": "66.249.79.64/27"
        },
        {
            "ipv4Prefix": "66.249.79.96/27"
        }
    ]
}

Here, I cut out most of it, I left a few items in it just for the sake of the sample.

So this is what our list looks like, we want to make a space-separated list out of it.

In the first round, wget download the list using the command:

wget -q -O- https://developers.google.com/static/search/apis/ipranges/googlebot.json

Then we get the JSON content displayed in the browser one by one in the terminal. The command parameters are:

  • -q: quiet mode. Silent mode. We do not receive program output, only the downloaded content itself.
  • -SHE: (Here we are talking about the capital letter O) Output file name, in which the downloaded content is saved. If "-" is specified, it will write the content to standard output.

So here we have the complete JSON output, we need to convert it to the correct form.

JSON data is most efficiently stored in the jq can be processed using the command. By default, this package is not available on Debian/Ubuntu systems, so we need to install it with the following command:

sudo apt-get install jq

If it is installed, you can combine it with the previous command using a pipeline:

wget -q -O- https://developers.google.com/static/search/apis/ipranges/googlebot.json | jq -r '.prefixes | .[] | .ipv4Prefix'

What happens here is that it contains the previously received JSON object text output we use a pipeline to direct it to the input of the jq command, which will handle the received content as a JSON object if it is syntactically correct. We will give you a few more options:

  • -r: With the option, if the result of the filter is a string, it will be written directly to standard output, rather than as a quoted JSON string. This can be useful for jq filters to communicate with non-JSON based systems. Simply put: it strips out the quotation marks.
  • '.prefixed | .[] | .ipv4Prefix' : Filter command. Specifically, for this JSON data structure, this filters out only the ".prefixes" fields, then the array enumerations within that, and the ".ipv4Prefix" fields within that. So this filter command is specifically good for this structure.

The output of the command is as follows:

[...]
null
null
null
null
null
34.100.182.96/28
34.101.50.144/28
34.118.254.0/28
34.118.66.0/28
34.126.178.96/28
[...]
66.249.79.128/27
66.249.79.160/27
66.249.79.192/27
[...]

Here, too, I cut off the beginning and the end (at the very end, Google's well-known IP addresses starting with 66.249 are also displayed). The point here is that, as a result of the command, we get a list in which the items are in a new line one by one. We get the null values ​​because the last element of our filter command ".ipv4Prefix" only returns IPv4 fields, and in all other cases to which this field name cannot be applied, such as "ipv6Prefix" fields, it will return null values. In the next step, we need to remove these null values.

Null values ​​can be trimmed in several ways:

If we don't want to be so sophisticated, but efficient and to the point, we can use other powerful tools of the shell, such as grep command -v can also be filtered with the switch:

wget -q -O- https://developers.google.com/static/search/apis/ipranges/googlebot.json | jq -r '.prefixes | .[] | .ipv4Prefix' | grep -v "null"

As a result, we get the above result, with the difference that "null" lines are picked out by grep. In this case, the "null" values stringtreated as

But if we want to be elegant, then a null values ​​can also be filtered out using the jq command's own tools:

wget -q -O- https://developers.google.com/static/search/apis/ipranges/googlebot.json | jq -r '.prefixes[] | .ipv4Prefix as $ip | if $ip != null then $ip else empty end'

Of course, this implementation is more complicated, but we solve it elegantly with the if-then-else structure of the jq command. And here you have to make sure that a null value is not interpreted as a character string in this case, so it must not be enclosed in quotation marks.

Actually, we can use any version here, the end result is the same: null values ​​are removed, and only IPv4 addresses and CIDR ranges remain.

Due to the shorter nature of the command, I will now proceed with the first method, grep -v, so that the command is not so long.

If we've made it this far, all we have to do is convert this list into a line separated by spaces. For this task, the Easter command is the best choice. So let's complete our previous command, I'm using the grep version here:

wget -q -O- https://developers.google.com/static/search/apis/ipranges/googlebot.json | jq -r '.prefixes | .[] | .ipv4Prefix' | grep -v "null" | paste -s -d " " -

Here's what happens, let's take a look at the switches of the paste command:

  • -s: serial: Takes data from one file / input, not from multiple files.
  • -D: delimiter character, which in this case is a space " ". Because we need items separated by spaces
  • -: According to a Linux convention (which not all programs follow), a single dash in the filename refers to standard input (stdin) or standard output (stdout), depending on the context. Since in this case it is the location of the output filename, it refers to stdout. In this way, we instruct the paste command to redirect the concatenated content to the standard output, i.e. to display it in the terminal.
    We have already spoken about this another specification.

The output of this command is:

Processing of IP addresses - Enumeration converted from JSON

So here we get a nice one-line listing, which is now only broken into several lines by the terminal window. 

Copy this list and put it in the "ignoreip" line as described at the beginning of the chapter:

ignoreip = 34.100.182.96/28 34.101.50.144/28 34.118.254.0/28 34.118.66.0/28 34.126.178.96/28 34.146.150.144/28 34.147.110.144/28 34.151.74.144/28 34.152.50.64/28 34.154.114.144/28 34.155.98.32/28 34.165.18.176/28 34.175.160.64/28 34.176.130.16/28 34.22.85.0/27 34.64.82.64/28 34.65.242.112/28 34.80.50.80/28 34.88.194.0/28 34.89.10.80/28 34.89.198.80/28 34.96.162.48/28 35.247.243.240/28 66.249.64.0/27 66.249.64.128/27 66.249.64.160/27 66.249.64.192/27 66.249.64.224/27 66.249.64.32/27 66.249.64.64/27 66.249.64.96/27 66.249.65.0/27 66.249.65.128/27 66.249.65.160/27 66.249.65.192/27 66.249.65.224/27 66.249.65.32/27 66.249.65.64/27 66.249.65.96/27 66.249.66.0/27 66.249.66.128/27 66.249.66.192/27 66.249.66.32/27 66.249.66.64/27 66.249.68.0/27 66.249.68.32/27 66.249.68.64/27 66.249.69.0/27 66.249.69.128/27 66.249.69.160/27 66.249.69.192/27 66.249.69.224/27 66.249.69.32/27 66.249.69.64/27 66.249.69.96/27 66.249.70.0/27 66.249.70.128/27 66.249.70.160/27 66.249.70.192/27 66.249.70.224/27 66.249.70.32/27 66.249.70.64/27 66.249.70.96/27 66.249.71.0/27 66.249.71.128/27 66.249.71.160/27 66.249.71.192/27 66.249.71.32/27 66.249.71.64/27 66.249.71.96/27 66.249.72.0/27 66.249.72.128/27 66.249.72.160/27 66.249.72.192/27 66.249.72.224/27 66.249.72.32/27 66.249.72.64/27 66.249.72.96/27 66.249.73.0/27 66.249.73.128/27 66.249.73.160/27 66.249.73.192/27 66.249.73.224/27 66.249.73.32/27 66.249.73.64/27 66.249.73.96/27 66.249.74.0/27 66.249.74.128/27 66.249.74.32/27 66.249.74.64/27 66.249.74.96/27 66.249.75.0/27 66.249.75.128/27 66.249.75.160/27 66.249.75.192/27 66.249.75.224/27 66.249.75.32/27 66.249.75.64/27 66.249.75.96/27 66.249.76.0/27 66.249.76.128/27 66.249.76.160/27 66.249.76.192/27 66.249.76.224/27 66.249.76.32/27 66.249.76.64/27 66.249.76.96/27 66.249.77.0/27 66.249.77.128/27 66.249.77.32/27 66.249.77.64/27 66.249.77.96/27 66.249.79.0/27 66.249.79.128/27 66.249.79.160/27 66.249.79.192/27 66.249.79.224/27 66.249.79.32/27 66.249.79.64/27 66.249.79.96/27

And if we have other IP address groups that we don't want to ban, we either add them after them, or put them on a new line with a line break "\ " as shown above.

In the same way, we can convert and set up the IP address sources linked above, if we need them.

The Amazon AWS JSON array does not use the "ipv4Prefix" field, but the "ip_prefix" field for IPv4 addresses. accordingly, for the Amazon JSON array, we can use the following processing command:

wget -q -O- https://ip-ranges.amazonaws.com/ip-ranges.json | jq -r '.prefixes | .[] | .ip_prefix' | grep -v "null" | paste -s -d " " -

I won't post the output here, because there are a lot of IP addresses.

With this, we are ready with the settings. If you have everything, then restart the Fail2Ban service:

systemctl restart fail2ban

Of course, in the meantime, look at the result in the /var/log/fail2ban.log file to make sure that all jails have started, re-entered the previously banned IP addresses (Restore Ban xxx.xxx.xxx.xxx), etc.

 

 

Advanced

For those who are in a hurry or are advanced and don't want to read through these theoretical parts, they can find here a very short extract, a summary of the above description, where only the commands and settings necessary to complete the task can be found. However, if these parts are not understandable, or if the settings and instructions are not working, it is recommended to read them from the beginning. Then come the settings, in short!
Attention!
In the following sections, I assume that Fail2Ban is already familiar to us, we have installed it on the system and have already used it, so here we only need to make a few small additional settings, so we understand the settings briefly described here. If we still do not understand these, do not implement them, but rather read the full description supplemented with the theoretical parts above!

There are two ways to set up the Fail2Ban system below.

Search for IP addresses at the beginning of lines

If we use Apache configuration files in which the log lines start with the clients' IP addresses, the following settings are applied:

Filter file setting:

nano /etc/fail2ban/filter.d/apache-4xx.conf

And let's put the content below.

[Definition]
# IP-címek keresése a sorok elején:
failregex = ^<HOST> -.*"(GET|POST|HEAD).*" (400|401|403|404) .*$

# Kivételek:
ignoreregex =.*(robots.txt|favicon.ico)

Let's save it.

Jail Setup:

nano /etc/fail2ban/jail.local

Add / append the following content to the end:

[apache-4xx]
enabled = true
port = http,https,8080,8081
filter = apache-4xx
logpath  = /var/log/ispconfig/httpd/*/access.log
findtime = 60
maxretry = 10
banaction = %(known/banaction)s[blocktype=DROP]
bantime = 3600
ignoreip =

Search for IP addresses inside lines

If we use Apache configuration files in which the log lines do not start with the IP addresses of the clients, but are located inside those lines, then the following settings are applied:

Filter file setting:

nano /etc/fail2ban/filter.d/apache-4xx.conf

Include the following content:

[Definition]
# IP-címek keresése a sorok belsejében:
failregex = .* <HOST> -.*"(GET|POST|HEAD).*" (400|401|403|404) .*$

# Kivételek:
ignoreregex =.*(robots.txt|favicon.ico)

Let's save it.

Jail Setup:

nano /etc/fail2ban/jail.local

Add / append the following content to the end:

[apache-4xx]
enabled = true
port = http,https,8080,8081
filter = apache-4xx
logpath  = %(apache_access_log)s
findtime = 60
maxretry = 10
banaction = %(known/banaction)s[blocktype=DROP]
bantime = 3600
ignoreip =

Let's save it.

 

Then we use any setting, set the ignoreip parameter in the jails, include the IP addresses that we want to exclude from this check, so we don't want them to be banned by the system, no matter how many hits they cause.

Here, I intentionally increased the value of maxretry to 10 compared to the one in the detailed description, so that there is less chance of causing unwanted bans here due to the absence of theoretical parts. Of course, we fine-tune these settings according to our own needs.

Once everything is set, restart Fail2Ban:

systemctl restart fail2ban

 

 

Conclusion

With these settings, we can disable requests that cause 4xx HTTP response codes in large quantities. Due to the length of the description, I also created an advanced section where I concisely described the settings. That way, everyone uses the part of the description that suits them.

However, we must pay enough attention to fine-tuning the Jail, because in the case of a poorly configured Jail, we can ban many visits that we did not originally intend to keep away from our websites.