Most websites have no clear idea which AI crawlers are visiting them, how often, or what they’re doing once they arrive. That’s a problem when you’re trying to make decisions about licensing, blocking, or optimisation. Let’s fix it.
The major AI crawlers (and what they do)
- GPTBot (OpenAI) — the training crawler for ChatGPT. Crawls the open web to build / refresh the model’s training corpus. User-agent:
GPTBot/1.0. Honours robots.txt.
- OAI-SearchBot (OpenAI) — separate crawler for ChatGPT Search. This one builds the live retrieval index, not training data. Different opt-out from GPTBot.
- ChatGPT-User (OpenAI) — fires when a ChatGPT user explicitly asks the model to fetch a URL. Honours robots.txt for the specific bot string.
- ClaudeBot (Anthropic) — Anthropic’s general training / retrieval crawler. Honours robots.txt.
- anthropic-ai (Anthropic, legacy) — older user-agent, still seen in some traffic. Same opt-out treatment.
- PerplexityBot (Perplexity) — citation-focused. Perplexity sends users back to you when it cites your content, so most publishers welcome this one.
- Google-Extended (Google) — Google’s separate AI-training crawler. Distinct from regular Googlebot — you can allow Googlebot (for search) and block Google-Extended (for AI training).
- Bytespider (ByteDance) — TikTok / ByteDance training crawler. Aggressive crawl rate; many publishers block this one.
- CCBot (Common Crawl) — the open dataset used downstream by many AI labs. Blocking CCBot effectively reduces your inclusion in dozens of derivative training sets.
- FacebookBot / Meta-ExternalAgent — Meta’s training crawlers.
Two flavours: training vs. retrieval
AI crawlers fall into two categories that warrant different responses:
- Training crawlers ingest content into model training. They don’t directly send users back; they make future AI better, which may or may not surface your content months later. GPTBot, ClaudeBot, Google-Extended, Bytespider, CCBot.
- Retrieval crawlers fetch content live when a user asks a question. They cite you with a link. They drive direct, attributable traffic. OAI-SearchBot, ChatGPT-User, PerplexityBot.
A common policy split: block training crawlers (or require licensing), allow retrieval crawlers (because they send users back).
Spoofed user-agents are everywhere
The user-agent string is just a header — anyone can set it to whatever. Scrapers routinely impersonate AI crawlers, especially GPTBot and PerplexityBot, to bypass anti-scraping rules. Reliable detection requires more than UA matching:
- Verify the request comes from the bot’s published IP range (OpenAI, Anthropic, Perplexity all publish theirs).
- Check for behavioural fingerprints — request rate, header order, JS-execution patterns.
- For unknowns: a behavioural classifier (which is what AI Bot Sentinel uses).
What you should do this week
- Audit your access logs — grep for the user-agents above. Count visits per bot over the last 30 days.
- Decide your policy per bot — allow, allow-with-citation-required, license-required, block.
- Encode it everywhere — robots.txt, TDM-REP headers, AIOX Capsule license fields. (AIOX does all three from a single config.)
- Monitor — actual traffic per bot, anomalies, scraper impersonation attempts.