mirror of https://github.com/fabriziosalmi/patterns.git synced 2026-06-11 06:54:15 -04:00

Files

Fabrizio Salmi 5c654b3da8 Redesign docs with Apple-native theme; verify content; route CI to self-hosted runner-02

- VitePress: custom theme (SF system fonts, glass nav, soft surfaces, pill buttons,
  light/dark code blocks, refined feature cards, platform showcase + stat strip).
- Replace every emoji across docs and README with inline SVG icons.
- Verify and fix doc accuracy against actual scripts: JSON schema (category+pattern only),
  env-var configuration for json2*/import_* scripts, owasp2json CLI surface.
- Add public assets (logo.svg, favicon.svg, hero-shield.svg) and Shiki haproxy alias.
- Workflows default to self-hosted runner-02 with a configurable fallback to GitHub
  runners via the RUNS_ON repo variable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 08:07:04 +02:00

3.8 KiB

Raw Blame History

Bad Bot Detection

badbots.py generates per-platform User-Agent blocklists alongside the OWASP rules, so you can drop noisy crawlers, AI scrapers, and known abusive scanners in a single include.

How it works

The script fetches public bot lists — including ai.robots.txt and other community-curated sources.
It deduplicates and normalizes the User-Agent patterns.
It emits one file per platform under waf_patterns/<platform>/.
The daily GitHub Actions workflow regenerates and republishes these files alongside the OWASP-derived rules.

If a primary source is unreachable, the script falls back to a bundled list so the build still succeeds.

Generated files

Platform	File	Format
Nginx	`bots.conf`	`map $http_user_agent $bad_bot`
Apache	`bots.conf`	ModSecurity `SecRule` directives
Traefik	`bots.toml`	Middleware regex replacements
HAProxy	`bots.acl`	One regex per line, loadable with `-f`

Nginx

# In the http block:
include /etc/nginx/waf_patterns/nginx/bots.conf;

# In any server block you want to protect:
server {
    if ($bad_bot) { return 403; }
}

The map looks like:

map $http_user_agent $bad_bot {
    default 0;
    "~*AhrefsBot"   1;
    "~*SemrushBot"  1;
    "~*MJ12bot"     1;
    "~*GPTBot"      1;
    # …
}

Apache

SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
    "id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"

Include the file globally or per VirtualHost:

Include /etc/apache2/waf_patterns/apache/bots.conf

HAProxy

acl bad_bot hdr(User-Agent) -m reg -i -f /etc/haproxy/bots.acl
http-request deny deny_status 403 if bad_bot

Traefik

[http.middlewares.bot-blocker]
  # populated automatically by bots.toml

Reference bot-blocker@file from the routers you want to protect.

What gets blocked

The default list groups User-Agent patterns into four broad categories.

SEO and marketing crawlers

Aggressive site indexers that are usually unwelcome on production traffic:

AhrefsBot
SemrushBot
MJ12bot
DotBot
BLEXBot

AI training crawlers

Most are documented at ai.robots.txt:

GPTBot, ChatGPT-User
ClaudeBot, Anthropic-AI
Google-Extended
CCBot, Bytespider, PerplexityBot

General scrapers

DataForSeoBot
PetalBot
Bytespider

Malicious scanners

Public vulnerability scanners and spam bots that have no legitimate reason to crawl your origin.

::: tip Search engines are not blocked Major search engines (Googlebot, Bingbot, DuckDuckBot, Baiduspider, YandexBot) are not included in the default block list — blocking them harms SEO. :::

Customization

Add your own pattern

# Append in bots.conf
"~*MyCustomBot" 1;

SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
    "id:200999,phase:1,deny,status:403"

Whitelist a bot

For Nginx, override the match before the catch-all:

map $http_user_agent $bad_bot {
    default 0;
    "~*Googlebot"   0;   # explicit allow
    "~*AhrefsBot"   1;
}

Allow bots inside a path

location /public-api/ {
    # bypass the bot rule for this path
    proxy_pass http://upstream;
}

location / {
    if ($bad_bot) { return 403; }
    proxy_pass http://upstream;
}

Regenerating manually

python badbots.py

The generated files end up in waf_patterns/<platform>/.

Monitoring

Track which patterns actually fire in your traffic:

# Top 20 user agents that hit a 403
awk '$9 == 403 {print $12}' /var/log/nginx/access.log \
  | sort | uniq -c | sort -rn | head -20

If you see legitimate traffic in the list, add it to a whitelist and re-include bots.conf after your override.

3.8 KiB Raw Blame History