Please install Yoast or RankMath to use breadcrumbs.

How AI Crawlers Actually Work: GPTBot, ClaudeBot, PerplexityBot Explained

Most websites have no clear idea which AI crawlers are visiting them, how often, or what they’re doing once they arrive. That’s a problem when you’re trying to make decisions about licensing, blocking, or optimisation. Let’s fix it.

The major AI crawlers (and what they do)

  • GPTBot (OpenAI) — the training crawler for ChatGPT. Crawls the open web to build / refresh the model’s training corpus. User-agent: GPTBot/1.0. Honours robots.txt.
  • OAI-SearchBot (OpenAI) — separate crawler for ChatGPT Search. This one builds the live retrieval index, not training data. Different opt-out from GPTBot.
  • ChatGPT-User (OpenAI) — fires when a ChatGPT user explicitly asks the model to fetch a URL. Honours robots.txt for the specific bot string.
  • ClaudeBot (Anthropic) — Anthropic’s general training / retrieval crawler. Honours robots.txt.
  • anthropic-ai (Anthropic, legacy) — older user-agent, still seen in some traffic. Same opt-out treatment.
  • PerplexityBot (Perplexity) — citation-focused. Perplexity sends users back to you when it cites your content, so most publishers welcome this one.
  • Google-Extended (Google) — Google’s separate AI-training crawler. Distinct from regular Googlebot — you can allow Googlebot (for search) and block Google-Extended (for AI training).
  • Bytespider (ByteDance) — TikTok / ByteDance training crawler. Aggressive crawl rate; many publishers block this one.
  • CCBot (Common Crawl) — the open dataset used downstream by many AI labs. Blocking CCBot effectively reduces your inclusion in dozens of derivative training sets.
  • FacebookBot / Meta-ExternalAgent — Meta’s training crawlers.

Two flavours: training vs. retrieval

AI crawlers fall into two categories that warrant different responses:

  • Training crawlers ingest content into model training. They don’t directly send users back; they make future AI better, which may or may not surface your content months later. GPTBot, ClaudeBot, Google-Extended, Bytespider, CCBot.
  • Retrieval crawlers fetch content live when a user asks a question. They cite you with a link. They drive direct, attributable traffic. OAI-SearchBot, ChatGPT-User, PerplexityBot.

A common policy split: block training crawlers (or require licensing), allow retrieval crawlers (because they send users back).

Spoofed user-agents are everywhere

The user-agent string is just a header — anyone can set it to whatever. Scrapers routinely impersonate AI crawlers, especially GPTBot and PerplexityBot, to bypass anti-scraping rules. Reliable detection requires more than UA matching:

  • Verify the request comes from the bot’s published IP range (OpenAI, Anthropic, Perplexity all publish theirs).
  • Check for behavioural fingerprints — request rate, header order, JS-execution patterns.
  • For unknowns: a behavioural classifier (which is what AI Bot Sentinel uses).

What you should do this week

  1. Audit your access logs — grep for the user-agents above. Count visits per bot over the last 30 days.
  2. Decide your policy per bot — allow, allow-with-citation-required, license-required, block.
  3. Encode it everywhere — robots.txt, TDM-REP headers, AIOX Capsule license fields. (AIOX does all three from a single config.)
  4. Monitor — actual traffic per bot, anomalies, scraper impersonation attempts.