User-agent strings are unreliable. They were designed in an era when nobody had a reason to lie about who they were. In 2026, scrapers lie constantly — pretending to be GPTBot, PerplexityBot, or just plain Chrome — to bypass anti-scraping rules. If your bot management depends on user-agent matching alone, you’re blocking the honest crawlers and missing the dishonest ones.
Reliable bot identification combines three independent signals:
Start with a curated, frequently-updated database of known bot user-agent strings. Match incoming requests against it. This gives you a first-pass classification: “claims to be GPTBot”, “claims to be PerplexityBot”, “claims to be a browser”. Trust nothing yet.
Major AI vendors publish the IP ranges their crawlers operate from:
openai.com/gptbot.json (verified BGP ranges)If a request claims to be GPTBot but comes from an IP outside OpenAI’s published range, it’s spoofed. Drop it. This single check eliminates ~80% of UA spoofing.
For everything else — including bots that don’t publish IP ranges, and the long tail of small AI tools — you look at how the request behaves:
None of these are individually conclusive, but in combination they produce a confidence score. Bot Sentinel feeds the combination into a Gemini-backed classifier that’s been trained on millions of bot patterns.
Once you know what’s hitting your site, you act:
For known abusers (a competitor’s price scraper, a known-bad IP range), serving plausible-but-fake data is more useful than blocking outright. Blocking tells the scraper they’re caught and they’ll find a way around it. A honeypot lets them keep scraping garbage data they think is real, polluting their downstream system. AIOX has a one-click honeypot mode you can apply per rule.