Skip to main content
Technical7 min read· Last updated:

How GPTBot, ClaudeBot & Perplexity Crawl Your Website

AI assistants send their own crawlers to index your content. Understanding how GPTBot, ClaudeBot, PerplexityBot, and Google-Extended work is the first step to ensuring they can find and cite your site.

By Kyle Fairburn, Founder & AI Specialist at NexRank

The Rise of AI Crawlers

In 2023–2024, every major AI platform launched its own web crawler. These bots don't rank pages for blue-link search results — they ingest content to power AI assistant responses, train future models, or both.

Understanding how each crawler works is essential for GEO. A site that accidentally blocks GPTBot is invisible to ChatGPT's browsing mode. A site that blocks all bots with a wildcard rule misses every AI platform simultaneously.

The Main AI Crawlers

GPTBot (OpenAI)

  • User-agent: GPTBot
  • IP ranges: Published at [openai.com/gptbot](https://openai.com/gptbot)
  • Purpose: Indexes content for ChatGPT's browsing mode and future model training
  • Crawl rate: Moderate — respects Crawl-delay directives
  • What it reads: HTML content, structured data, meta tags, Open Graph

GPTBot was introduced in August 2023. It powers ChatGPT's "Browse with Bing" capability and is used for pre-training data collection. OpenAI publishes its IP ranges and provides opt-out documentation.

ClaudeBot (Anthropic)

  • User-agent: ClaudeBot
  • Purpose: Content indexing for Claude's retrieval system
  • Crawl rate: Conservative — lower frequency than GPTBot
  • What it reads: Main content, headings, structured data

Anthropic's ClaudeBot follows robots.txt rules and respects crawl delays. Unlike some crawlers, it is specifically designed to avoid pages with personal data and respects noindex signals.

PerplexityBot

  • User-agent: PerplexityBot
  • Purpose: Real-time web retrieval for Perplexity AI answers
  • Crawl rate: High frequency — Perplexity cites live sources
  • What it reads: Everything — Perplexity prioritises recency

PerplexityBot is particularly important because Perplexity AI always cites sources inline. Being indexed by PerplexityBot directly translates to in-answer citations visible to users.

Google-Extended

  • User-agent: Google-Extended
  • Purpose: Google Bard/Gemini training and AI Overviews
  • Crawl rate: High — Google's infrastructure is extensive
  • What it reads: All content that Googlebot indexes

Google-Extended is separate from Googlebot. You can block Google-Extended without affecting your traditional Google rankings — though doing so may reduce your visibility in Google's AI Overviews.

Bingbot (Microsoft Copilot)

  • User-agent: Bingbot
  • Purpose: Powers both Bing Search and Microsoft Copilot
  • Crawl rate: High
  • Note: The same Bingbot powers both traditional Bing results and AI-generated Copilot answers

How AI Crawlers Differ from Googlebot

Traditional search crawlers (Googlebot) prioritise:

  • Page rank signals (backlinks, authority)
  • Keyword relevance
  • Page speed and Core Web Vitals

AI crawlers prioritise:

  • Structured data — JSON-LD schemas that explicitly declare what a page is about
  • Content extractability — Clean HTML that can be parsed without JavaScript execution
  • Factual density — Pages with statistics, definitions, and citable claims
  • Freshness — Recently updated content for RAG-based systems like Perplexity

robots.txt Configuration for AI Crawlers

The problem: wildcard blocks

Many sites use a wildcard disallow to block all bots except Googlebot:

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

This configuration blocks every AI crawler. If your robots.txt looks like this, you are invisible to ChatGPT, Perplexity, Claude, and every other AI platform.

The recommended configuration

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /private/

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Explicitly listing each AI bot ensures they have clear permission — even if your default rules are restrictive.

Pages to block from AI crawlers

Some content should not be indexed by AI systems:

  • Admin and authentication pages: Disallow: /admin/
  • User-generated private content: Disallow: /dashboard/
  • Duplicate or thin pages: Disallow: /tag/, Disallow: /author/
  • API endpoints: Disallow: /api/

llms.txt: The AI-Specific robots.txt

In 2024, a new standard emerged: llms.txt. Placed at yourdomain.com/llms.txt, it provides AI systems with a structured overview of your site — similar to what robots.txt does for crawl permissions, but for content discovery.

A well-structured llms.txt includes:

  • Company name and description
  • What the site covers
  • Key pages with descriptions
  • Content that AI assistants may cite
  • Contact and verification information

Cloudflare, Anthropic, and dozens of major platforms have adopted the llms.txt standard. Sites without it are missing a direct communication channel to AI crawlers.

JavaScript Rendering: A Common Crawl Blocker

Many AI crawlers — particularly older versions — cannot execute JavaScript. If your content is rendered client-side (React, Vue, Angular SPAs), AI crawlers may see empty pages.

Test this: Use curl -A "GPTBot" https://yourdomain.com to see what a raw HTTP request returns. If the response is minimal HTML with no content, AI crawlers cannot read your pages.

Fix: Implement Server-Side Rendering (SSR) or Static Site Generation (SSG). Next.js, Nuxt, and SvelteKit all support this. For existing SPAs, consider adding a server-rendered sitemap and key landing pages.

Crawl Budget and Frequency

AI crawlers visit sites at different frequencies:

  • Perplexity: Most frequent — real-time retrieval means continuous crawling
  • GPTBot: Moderate — periodic updates to its knowledge base
  • ClaudeBot: Conservative — less frequent than Google or OpenAI
  • Google-Extended: High — piggybacks on Google's existing infrastructure

To maximise crawl coverage:

  1. Submit your sitemap.xml via Google Search Console (Google-Extended benefits automatically)
  2. Use IndexNow to notify Bing (and Bingbot/Copilot) of new content instantly
  3. Ensure fast page loads — crawlers abandon slow pages and return less frequently
  4. Use internal linking to help crawlers discover all your content

Monitoring AI Crawler Activity

Check your server access logs for AI crawler user-agents. A typical log entry looks like:

66.249.64.0 - - [01/Mar/2025:09:14:22 +0000] "GET /llms.txt HTTP/1.1" 200 1248 "-" "GPTBot/1.0"

If you never see AI crawlers in your logs, investigate:

  • Is your robots.txt blocking them?
  • Is your content behind JavaScript rendering?
  • Is your site returning slow responses?
  • Are you using Cloudflare or another WAF that blocks unfamiliar user-agents?

Summary: AI Crawler Checklist

  • [ ] Verify robots.txt does not block GPTBot, ClaudeBot, PerplexityBot, Google-Extended
  • [ ] Create llms.txt at your domain root
  • [ ] Ensure all public pages are server-side rendered (no JS-only content)
  • [ ] Submit sitemap.xml to Google Search Console
  • [ ] Enable IndexNow for real-time Bing/Perplexity notification
  • [ ] Check server logs for AI crawler activity
  • [ ] Test with curl -A "GPTBot" https://yourdomain.com

Getting all of these right is the technical foundation of GEO. Without crawlability, no other optimisation will reach AI systems.

Check your GEO score for free

See how your website scores across all 8 GEO categories. Takes 60 seconds.

Get your free GEO score →