Programming

Killing Bots at the Gate: Detecting Malicious Crawlers with Nginx

Bots are a fact of life on the internet. Some are helpful—like search engine crawlers. Others scrape your data, spam your forms, or brute-force your login pages. If you’re self-hosting with Nginx, you don’t need a pricey SaaS WAF to stop them. Here's how to detect and destroy malicious bots using good ol’ Nginx, a few scripts, and some zip-bomb flavor. 1. Start with Logs — Always Nginx logs tell the full story. Make sure you're capturing User-Agent, IP, and paths. log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent"'; access_log /var/log/nginx/access.log main; Now dig through logs for patterns: # Top IPs by request volume awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head # Suspicious User-Agents grep -iE 'curl|wget|python|scrapy|bot|crawler|headless' /var/log/nginx/access.log | less Want real-time views? Try GoAccess for a terminal dashboard. 2. Identify Suspicious Behavior Things that scream “bot”: Blank or obviously fake User-Agent headers High request volume from a single IP Frequent hits to /wp-login.php, /xmlrpc.php, /admin, or random paths Unusual Referer headers or none at all Crawlers hitting endpoints that no normal user would Bonus: check your logs against public bot signature lists like MitchellKrogza’s bad bot list. 3. Block the Obvious Stuff with Nginx Create a quick and dirty User-Agent filter: map $http_user_agent $bad_bot { default 0; ~(curl|wget|python|scrapy|bot|Go-http-client) 1; } server { if ($bad_bot) { return 403; } } And rate limit abusive IPs: limit_req_zone $binary_remote_addr zone=abusers:10m rate=5r/s; server { location / { limit_req zone=abusers burst=10 nodelay; ... } } Also check out Nginx rate limiting docs. 4. Use Fail2Ban to Auto-Ban IPs Install Fail2Ban and wire it to your Nginx logs: Jail config (/etc/fail2ban/jail.local): [nginx-badbots] enabled = true filter = nginx-badbots logpath = /var/log/nginx/access.log maxretry = 5 findtime = 600 bantime = 3600 Filter (/etc/fail2ban/filter.d/nginx-badbots.conf): [Definition] failregex = ^ -."(GET|POST).HTTP."(curl|wget|python|scrapy|bot|Go-http-client) ignoreregex = Once this is running, bots get banned automatically after a few hits. 5. Use Better Tools for Smarter Bots If you're seeing more sophisticated attacks, try: CrowdSec: Open-source tool that shares a dynamic IP reputation list and applies bans ModSecurity: Full WAF, works with Nginx OpenResty: Extend Nginx with Lua scripting (e.g., custom captcha, behavior analysis) If you’re open to a proxy layer: Cloudflare free tier: Blocks a lot of trash automatically Fastly Bot Protection: Advanced but paid Bonus Serve Zip Bombs to Dumb Bots (⚠️ Handle with care) This blog post by Idiallo shows how he turned bot detection into punishment. The method? Serve them a compressed zip bomb. To generate one: dd if=/dev/zero bs=1G count=10 | gzip -c > 10GB.gz This creates a ~10MB file that decompresses to 10GB of zeros. If a bot tries to read it without knowing, it chokes. Then detect and serve it: if (ipIsBlackListed() || isMalicious()) { header("Content-Encoding: deflate, gzip"); header("Content-Length: " . filesize(ZIP_BOMB_FILE_10G)); readfile(ZIP_BOMB_FILE_10G); exit; } He explains that when traffic spikes, he swaps in a 1MB variant. It’s a great deterrent for low-effort bots. Heuristics like repeated scanning and double-visits from spam IPs helped him fine-tune this method.

May 3, 2025 - 17:38

Killing Bots at the Gate: Detecting Malicious Crawlers with Nginx

Bots are a fact of life on the internet.

Some are helpful—like search engine crawlers.

Others scrape your data, spam your forms, or brute-force your login pages.

If you’re self-hosting with Nginx, you don’t need a pricey SaaS WAF to stop them.

Here's how to detect and destroy malicious bots using good ol’ Nginx, a few scripts, and some zip-bomb flavor.

1. Start with Logs — Always

Nginx logs tell the full story. Make sure you're capturing User-Agent, IP, and paths.

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                  '$status $body_bytes_sent "$http_referer" '
                  '"$http_user_agent"';
access_log  /var/log/nginx/access.log  main;

Now dig through logs for patterns:

# Top IPs by request volume
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head

# Suspicious User-Agents
grep -iE 'curl|wget|python|scrapy|bot|crawler|headless' /var/log/nginx/access.log | less

Want real-time views? Try GoAccess for a terminal dashboard.

2. Identify Suspicious Behavior

Things that scream “bot”:

Blank or obviously fake User-Agent headers
High request volume from a single IP
Frequent hits to /wp-login.php, /xmlrpc.php, /admin, or random paths
Unusual Referer headers or none at all
Crawlers hitting endpoints that no normal user would

Bonus: check your logs against public bot signature lists like MitchellKrogza’s bad bot list.

3. Block the Obvious Stuff with Nginx

Create a quick and dirty User-Agent filter:

map $http_user_agent $bad_bot {
    default 0;
    ~*(curl|wget|python|scrapy|bot|Go-http-client) 1;
}

server {
    if ($bad_bot) {
        return 403;
    }
}

And rate limit abusive IPs:

limit_req_zone $binary_remote_addr zone=abusers:10m rate=5r/s;

server {
    location / {
        limit_req zone=abusers burst=10 nodelay;
        ...
    }
}

Also check out Nginx rate limiting docs.

4. Use Fail2Ban to Auto-Ban IPs

Install Fail2Ban and wire it to your Nginx logs:

Jail config (/etc/fail2ban/jail.local):

[nginx-badbots]
enabled  = true
filter   = nginx-badbots
logpath  = /var/log/nginx/access.log
maxretry = 5
findtime = 600
bantime  = 3600

Filter (/etc/fail2ban/filter.d/nginx-badbots.conf):

[Definition]
failregex = ^ -.*"(GET|POST).*HTTP.*"(curl|wget|python|scrapy|bot|Go-http-client)
ignoreregex =

Once this is running, bots get banned automatically after a few hits.

5. Use Better Tools for Smarter Bots

If you're seeing more sophisticated attacks, try:

CrowdSec: Open-source tool that shares a dynamic IP reputation list and applies bans
ModSecurity: Full WAF, works with Nginx
OpenResty: Extend Nginx with Lua scripting (e.g., custom captcha, behavior analysis)

If you’re open to a proxy layer:

Cloudflare free tier: Blocks a lot of trash automatically
Fastly Bot Protection: Advanced but paid

Bonus Serve Zip Bombs to Dumb Bots (⚠️ Handle with care)

This blog post by Idiallo shows how he turned bot detection into punishment.

The method? Serve them a compressed zip bomb.

To generate one:

dd if=/dev/zero bs=1G count=10 | gzip -c > 10GB.gz

This creates a ~10MB file that decompresses to 10GB of zeros.

If a bot tries to read it without knowing, it chokes.

Then detect and serve it:

if (ipIsBlackListed() || isMalicious()) {
    header("Content-Encoding: deflate, gzip");
    header("Content-Length: " . filesize(ZIP_BOMB_FILE_10G));
    readfile(ZIP_BOMB_FILE_10G);
    exit;
}