The Five AI Crawlers You Need to Know

Every major AI assistant uses one or more crawlers to index web content. Understanding which crawler powers which AI — and how to ensure they can access your site — is a prerequisite for AI visibility.

GPTBot (OpenAI / ChatGPT)

User-agent: GPTBot
Powers: ChatGPT browsing mode, Bing AI (via partnership), future model training
Crawl behaviour: Moderate frequency, respects robots.txt and Crawl-delay
IP ranges: Published at openai.com/gptbot

GPTBot was launched in August 2023 and is one of the most widely encountered AI crawlers. It powers ChatGPT's ability to cite current web sources in responses. Blocking GPTBot means ChatGPT cannot retrieve live information about your business.

ClaudeBot (Anthropic / Claude)

User-agent: ClaudeBot
Powers: Anthropic Claude's content retrieval system
Crawl behaviour: Conservative, privacy-aware, avoids personal data
Notable: Explicitly designed not to crawl pages containing personal user data

ClaudeBot respects robots.txt strictly and maintains a lower crawl rate than most crawlers. Anthropic publishes its crawler documentation and IP ranges for verification.

PerplexityBot (Perplexity AI)

User-agent: PerplexityBot
Powers: Perplexity AI real-time search and answer generation
Crawl behaviour: High frequency — Perplexity emphasises real-time accuracy
Why it matters: Perplexity always cites sources inline, making it one of the highest-visibility AI citation channels

PerplexityBot is particularly valuable to optimise for because every Perplexity answer that includes your content shows your URL and domain name to users. There is no "invisible citation" — your brand is explicitly displayed.

Google-Extended (Google / Gemini & AI Overviews)

User-agent: Google-Extended
Powers: Google Gemini, Google AI Overviews (in search results)
Crawl behaviour: Inherits much of Googlebot's infrastructure
Key distinction: You can block Google-Extended without affecting regular Google rankings

Google-Extended is how Google builds the training data and retrieval content for its AI products. Because AI Overviews now appear in a large proportion of Google searches, blocking Google-Extended has significant commercial consequences.

Bingbot (Microsoft / Copilot)

User-agent: Bingbot
Powers: Both Bing Search and Microsoft Copilot
Crawl behaviour: Frequent, comprehensive
Note: The same crawler powers both traditional Bing rankings and AI-generated Copilot answers

Unlike Google, Microsoft does not separate its traditional search crawler from its AI crawler. Allowing Bingbot allows both Bing ranking and Copilot citation simultaneously.

The Most Expensive robots.txt Mistake

The most common configuration blocking all AI crawlers:

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

This is extremely common on sites that want to exclude themselves from all search engines except Google. The problem: it blocks every AI crawler without exception. A site with this robots.txt is completely invisible to ChatGPT, Claude, Perplexity, and every other AI platform.

Another common mistake:

User-agent: *
Disallow: /wp-admin/

This looks harmless — it is only blocking the admin directory. But many sites have additional directives that accidentally restrict AI bots further. Always check your full robots.txt for any rules that could affect the AI user-agents listed above.

Getting robots.txt Right for AI Access

A correctly configured robots.txt for AI visibility explicitly allows each AI crawler by name — not just as part of a wildcard default — so access is unambiguous even when other rules are restrictive. The key is ensuring that each of the five major AI crawlers has clear permission to access your content.

The GEO scan checks your actual robots.txt against all five AI crawler user-agents and shows you exactly which are allowed, which are blocked, and what change would fix it. This is one of the most common critical issues we find — and one of the fastest to resolve once identified.

AI Training vs AI Browsing: An Important Distinction

Some website owners want to prevent AI companies from using their content for model training, while still allowing AI assistants to cite their live content in responses. This is a valid distinction.

Training: The process of using your content to build the model's internal knowledge (static, affects all future responses)
Browsing/Retrieval: Real-time crawling to retrieve current content for a specific user query (dynamic, cites you directly)

For training opt-outs, OpenAI and Google provide specific mechanisms outside of robots.txt. Blocking GPTBot in robots.txt opts you out of both training AND live browsing — which is often not the intention.

If you want to allow browsing but restrict training: contact the AI platforms directly, as robots.txt cannot currently make this distinction at a per-bot level.

What the GEO Scan Checks

AI crawler readiness is binary — you are either accessible or you are not. Getting this right is the prerequisite for all other GEO work. There is no amount of content optimisation that compensates for a crawler that cannot reach your pages.

Your GEO scan automatically tests your robots.txt against all five AI crawler user-agents, checks your pages for server-side rendering, and verifies your sitemap. You get a clear pass/fail result for each crawler with the exact change needed to fix any issues. Run your free scan to see which AI bots can access your site today — and which cannot.

Check your GEO score for free

See how your website scores across all 8 GEO categories. Takes 60 seconds.

Get your free GEO score →

← Back to blog