The Rise of AI Crawlers

In 2023–2024, every major AI platform launched its own web crawler. These bots don't rank pages for blue-link search results — they ingest content to power AI assistant responses, train future models, or both.

Understanding how each crawler works is essential for GEO. A site that accidentally blocks GPTBot is invisible to ChatGPT's browsing mode. A site that blocks all bots with a wildcard rule misses every AI platform simultaneously.

The Main AI Crawlers

GPTBot (OpenAI)

User-agent: GPTBot
IP ranges: Published at [openai.com/gptbot](https://openai.com/gptbot)
Purpose: Indexes content for ChatGPT's browsing mode and future model training
Crawl rate: Moderate — respects Crawl-delay directives
What it reads: HTML content, structured data, meta tags, Open Graph

GPTBot was introduced in August 2023. It powers ChatGPT's "Browse with Bing" capability and is used for pre-training data collection. OpenAI publishes its IP ranges and provides opt-out documentation.

ClaudeBot (Anthropic)

User-agent: ClaudeBot
Purpose: Content indexing for Claude's retrieval system
Crawl rate: Conservative — lower frequency than GPTBot
What it reads: Main content, headings, structured data

Anthropic's ClaudeBot follows robots.txt rules and respects crawl delays. Unlike some crawlers, it is specifically designed to avoid pages with personal data and respects noindex signals.

PerplexityBot

User-agent: PerplexityBot
Purpose: Real-time web retrieval for Perplexity AI answers
Crawl rate: High frequency — Perplexity cites live sources
What it reads: Everything — Perplexity prioritises recency

PerplexityBot is particularly important because Perplexity AI always cites sources inline. Being indexed by PerplexityBot directly translates to in-answer citations visible to users.

Google-Extended

User-agent: Google-Extended
Purpose: Google Bard/Gemini training and AI Overviews
Crawl rate: High — Google's infrastructure is extensive
What it reads: All content that Googlebot indexes

Google-Extended is separate from Googlebot. You can block Google-Extended without affecting your traditional Google rankings — though doing so may reduce your visibility in Google's AI Overviews.

Bingbot (Microsoft Copilot)

User-agent: Bingbot
Purpose: Powers both Bing Search and Microsoft Copilot
Crawl rate: High
Note: The same Bingbot powers both traditional Bing results and AI-generated Copilot answers

How AI Crawlers Differ from Googlebot

Traditional search crawlers (Googlebot) prioritise:

Page rank signals (backlinks, authority)
Keyword relevance
Page speed and Core Web Vitals

AI crawlers prioritise:

Structured data — JSON-LD schemas that explicitly declare what a page is about
Content extractability — Clean HTML that can be parsed without JavaScript execution
Factual density — Pages with statistics, definitions, and citable claims
Freshness — Recently updated content for RAG-based systems like Perplexity

robots.txt Configuration for AI Crawlers

The problem: wildcard blocks

Many sites use a wildcard disallow to block all bots except Googlebot:

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

This configuration blocks every AI crawler. If your robots.txt looks like this, you are invisible to ChatGPT, Perplexity, Claude, and every other AI platform.

The recommended configuration

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /private/

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Explicitly listing each AI bot ensures they have clear permission — even if your default rules are restrictive.

Pages to block from AI crawlers

Some content should not be indexed by AI systems:

Admin and authentication pages: Disallow: /admin/
User-generated private content: Disallow: /dashboard/
Duplicate or thin pages: Disallow: /tag/, Disallow: /author/
API endpoints: Disallow: /api/

llms.txt: The AI-Specific robots.txt

In 2024, a new standard emerged: llms.txt. Placed at yourdomain.com/llms.txt, it provides AI systems with a structured overview of your site — similar to what robots.txt does for crawl permissions, but for content discovery.

A well-structured llms.txt includes:

Company name and description
What the site covers
Key pages with descriptions
Content that AI assistants may cite
Contact and verification information

Cloudflare, Anthropic, and dozens of major platforms have adopted the llms.txt standard. Sites without it are missing a direct communication channel to AI crawlers.

JavaScript Rendering: A Common Crawl Blocker

Many AI crawlers — particularly older versions — cannot execute JavaScript. If your content is rendered client-side (React, Vue, Angular SPAs), AI crawlers may see empty pages.

Test this: Use curl -A "GPTBot" https://yourdomain.com to see what a raw HTTP request returns. If the response is minimal HTML with no content, AI crawlers cannot read your pages.

Fix: Implement Server-Side Rendering (SSR) or Static Site Generation (SSG). Next.js, Nuxt, and SvelteKit all support this. For existing SPAs, consider adding a server-rendered sitemap and key landing pages.

Crawl Budget and Frequency

AI crawlers visit sites at different frequencies:

Perplexity: Most frequent — real-time retrieval means continuous crawling
GPTBot: Moderate — periodic updates to its knowledge base
ClaudeBot: Conservative — less frequent than Google or OpenAI
Google-Extended: High — piggybacks on Google's existing infrastructure

To maximise crawl coverage:

Submit your sitemap.xml via Google Search Console (Google-Extended benefits automatically)
Use IndexNow to notify Bing (and Bingbot/Copilot) of new content instantly
Ensure fast page loads — crawlers abandon slow pages and return less frequently
Use internal linking to help crawlers discover all your content

Monitoring AI Crawler Activity

Check your server access logs for AI crawler user-agents. A typical log entry looks like:

66.249.64.0 - - [01/Mar/2025:09:14:22 +0000] "GET /llms.txt HTTP/1.1" 200 1248 "-" "GPTBot/1.0"

If you never see AI crawlers in your logs, investigate:

Is your robots.txt blocking them?
Is your content behind JavaScript rendering?
Is your site returning slow responses?
Are you using Cloudflare or another WAF that blocks unfamiliar user-agents?

Summary: AI Crawler Checklist

[ ] Verify robots.txt does not block GPTBot, ClaudeBot, PerplexityBot, Google-Extended
[ ] Create llms.txt at your domain root
[ ] Ensure all public pages are server-side rendered (no JS-only content)
[ ] Submit sitemap.xml to Google Search Console
[ ] Enable IndexNow for real-time Bing/Perplexity notification
[ ] Check server logs for AI crawler activity
[ ] Test with curl -A "GPTBot" https://yourdomain.com

Getting all of these right is the technical foundation of GEO. Without crawlability, no other optimisation will reach AI systems.

Check your GEO score for free

See how your website scores across all 8 GEO categories. Takes 60 seconds.

Get your free GEO score →

← Back to blog