Please install Yoast or RankMath to use breadcrumbs.

AI Bot Detection: Separating Real Crawlers from Scrapers

User-agent strings are unreliable. They were designed in an era when nobody had a reason to lie about who they were. In 2026, scrapers lie constantly — pretending to be GPTBot, PerplexityBot, or just plain Chrome — to bypass anti-scraping rules. If your bot management depends on user-agent matching alone, you’re blocking the honest crawlers and missing the dishonest ones.

Three layers of detection

Reliable bot identification combines three independent signals:

1. User-agent matching (necessary but not sufficient)

Start with a curated, frequently-updated database of known bot user-agent strings. Match incoming requests against it. This gives you a first-pass classification: “claims to be GPTBot”, “claims to be PerplexityBot”, “claims to be a browser”. Trust nothing yet.

2. IP-range verification

Major AI vendors publish the IP ranges their crawlers operate from:

  • OpenAI: openai.com/gptbot.json (verified BGP ranges)
  • Anthropic: published in their support docs
  • Perplexity: published verification process
  • Google-Extended: same DNS lookup as Googlebot

If a request claims to be GPTBot but comes from an IP outside OpenAI’s published range, it’s spoofed. Drop it. This single check eliminates ~80% of UA spoofing.

3. Behavioural fingerprinting

For everything else — including bots that don’t publish IP ranges, and the long tail of small AI tools — you look at how the request behaves:

  • Request rate (real browsers don’t hit 60 req/sec)
  • Header order (browsers send headers in predictable orders; scrapers often don’t)
  • Cookie behaviour (real browsers store and resend cookies; most scrapers don’t)
  • JS-execution probes (real browsers run JavaScript; most scrapers can’t)
  • TLS fingerprint (JA3) — browsers and HTTP libraries have distinct TLS signatures

None of these are individually conclusive, but in combination they produce a confidence score. Bot Sentinel feeds the combination into a Gemini-backed classifier that’s been trained on millions of bot patterns.

What you do with the classification

Once you know what’s hitting your site, you act:

  • Verified known bot — apply the per-bot rule from your licensing config.
  • Unverified known bot (UA spoofed) — block or honeypot.
  • Unknown but well-behaved — log, surface in operator dashboard for human review.
  • Unknown and aggressive — rate-limit; auto-block if pattern persists.

The honeypot pattern

For known abusers (a competitor’s price scraper, a known-bad IP range), serving plausible-but-fake data is more useful than blocking outright. Blocking tells the scraper they’re caught and they’ll find a way around it. A honeypot lets them keep scraping garbage data they think is real, polluting their downstream system. AIOX has a one-click honeypot mode you can apply per rule.

What you don’t want to do

  • Block every bot with “AI” in the user-agent — you’ll block PerplexityBot, which sends users back to you.
  • Rely on robots.txt as enforcement — it’s advisory only. Malicious bots ignore it.
  • Build your own classifier from scratch — bot patterns shift fast; you’ll fall behind unless this is your full-time job.