Please install Yoast or RankMath to use breadcrumbs.

AI Bot Sentinel: Catching Misbehaving Crawlers in Real Time

Most websites have no idea who’s actually fetching their content. This is fine when “who” is just human users via browsers. It’s increasingly NOT fine when “who” is a mix of AI training crawlers, retrieval crawlers, monitoring services, competitive scrapers, and the long tail of small bots — each of which warrants different treatment.

What “real-time” actually means

AI Bot Sentinel classifies every incoming request in under 5ms of latency. That’s table stakes for production traffic — you can’t add 200ms of bot-classification time to every page load. The classifier runs at the edge (or close to it), feeds back a decision (allow, block, throttle, honeypot), and the request continues.

The first time a new bot pattern is seen, it gets a more thorough analysis (the AI-assisted classification path) which takes ~50ms — done off the critical path, results cached so subsequent requests from the same bot are instant.

What gets caught (with examples)

Typical traffic distribution on a mid-sized publisher:

Real humans: 60-70%. Allow through, no friction.
Known good crawlers (Googlebot, Bingbot, PerplexityBot): 8-15%. Allow.
Known AI training crawlers (GPTBot, ClaudeBot, etc.): 5-12%. Apply your licensing config.
Unknown but well-behaved: 3-8%. Log, surface in dashboard for review, allow by default.
Scrapers pretending to be browsers: 4-10%. Detect via TLS fingerprint, behaviour, header anomalies. Honeypot or block.
Scrapers pretending to be AI crawlers: 1-5%. Detect via IP-range mismatch. Block.
Genuinely malicious traffic (account-stuffing attempts, vuln scans, etc.): 1-3%. Block, rate-limit, or honeypot depending on pattern.

The classification stack

Sentinel chains four detection signals, each cheap individually:

User-agent match against the curated database. First-pass classification.
IP-range verification for claimed AI crawlers. Catches ~80% of UA spoofing.
Behavioural fingerprint — request rate, header order, JS-execution, cookie behaviour, TLS signature. Confidence score.
AI classifier for unknowns. Gemini-backed, trained on millions of bot patterns. Used only when the first three signals don’t produce high-confidence classification.

Each stage can short-circuit the next. Most requests are classified after stage 1-2 and never touch the AI classifier (which is where token cost would happen).

What you see in the dashboard

Live feed of incoming requests with classification, IP, UA, action taken.
Per-bot rollups over the last 24h / 7d / 30d.
Suggestions queue: bots the classifier wasn’t sure about — your decisions train the model.
Per-IP history: search any IP, see every visit, decision, and pattern.
Audit log: every rule change, who made it, when.

Real-world cases

Things Bot Sentinel has caught in customer deployments:

A “GPTBot” coming from an AWS IP not in OpenAI’s published range → scraper. Honeypotted.
A small AI startup’s crawler that was hitting the same article 500 times/minute (their dev had set retries wrong) → rate-limited with a friendly 429.
A competitor’s price-scraper rotating through residential IPs → caught via behavioural fingerprint. Served decoy pricing.
A research lab’s experimental crawler with a never-before-seen user-agent → classified as “unknown but legitimate”, surfaced to the operator who allow-listed it.