Most websites have no idea who’s actually fetching their content. This is fine when “who” is just human users via browsers. It’s increasingly NOT fine when “who” is a mix of AI training crawlers, retrieval crawlers, monitoring services, competitive scrapers, and the long tail of small bots — each of which warrants different treatment.
What “real-time” actually means
AI Bot Sentinel classifies every incoming request in under 5ms of latency. That’s table stakes for production traffic — you can’t add 200ms of bot-classification time to every page load. The classifier runs at the edge (or close to it), feeds back a decision (allow, block, throttle, honeypot), and the request continues.
The first time a new bot pattern is seen, it gets a more thorough analysis (the AI-assisted classification path) which takes ~50ms — done off the critical path, results cached so subsequent requests from the same bot are instant.
What gets caught (with examples)
Typical traffic distribution on a mid-sized publisher:
- Real humans: 60-70%. Allow through, no friction.
- Known good crawlers (Googlebot, Bingbot, PerplexityBot): 8-15%. Allow.
- Known AI training crawlers (GPTBot, ClaudeBot, etc.): 5-12%. Apply your licensing config.
- Unknown but well-behaved: 3-8%. Log, surface in dashboard for review, allow by default.
- Scrapers pretending to be browsers: 4-10%. Detect via TLS fingerprint, behaviour, header anomalies. Honeypot or block.
- Scrapers pretending to be AI crawlers: 1-5%. Detect via IP-range mismatch. Block.
- Genuinely malicious traffic (account-stuffing attempts, vuln scans, etc.): 1-3%. Block, rate-limit, or honeypot depending on pattern.
The classification stack
Sentinel chains four detection signals, each cheap individually:
- User-agent match against the curated database. First-pass classification.
- IP-range verification for claimed AI crawlers. Catches ~80% of UA spoofing.
- Behavioural fingerprint — request rate, header order, JS-execution, cookie behaviour, TLS signature. Confidence score.
- AI classifier for unknowns. Gemini-backed, trained on millions of bot patterns. Used only when the first three signals don’t produce high-confidence classification.
Each stage can short-circuit the next. Most requests are classified after stage 1-2 and never touch the AI classifier (which is where token cost would happen).
What you see in the dashboard
- Live feed of incoming requests with classification, IP, UA, action taken.
- Per-bot rollups over the last 24h / 7d / 30d.
- Suggestions queue: bots the classifier wasn’t sure about — your decisions train the model.
- Per-IP history: search any IP, see every visit, decision, and pattern.
- Audit log: every rule change, who made it, when.
Real-world cases
Things Bot Sentinel has caught in customer deployments:
- A “GPTBot” coming from an AWS IP not in OpenAI’s published range → scraper. Honeypotted.
- A small AI startup’s crawler that was hitting the same article 500 times/minute (their dev had set retries wrong) → rate-limited with a friendly 429.
- A competitor’s price-scraper rotating through residential IPs → caught via behavioural fingerprint. Served decoy pricing.
- A research lab’s experimental crawler with a never-before-seen user-agent → classified as “unknown but legitimate”, surfaced to the operator who allow-listed it.