๐ŸฆŠStackFox
Common Crawl Foundation logo

CCBot

Tier 2
๐Ÿ“š AI Trainingby Common Crawl Foundation โ†—ยท Since 2008

Open web archive used by many AI companies for training data.

User-Agent Token
CCBot
Respects robots.txt
Yes
Impact Level
Major
100M+ users - Meta, Apple, Microsoft, Perplexity, xAI
Estimated Reach
Powers training for OpenAI, Anthropic, Google, Meta, Amazon, Nvidia

๐ŸŽฏWhat is CCBot?

CCBot is an AI training crawler operated by Common Crawl Foundation. Collects data to train AI models.

๐Ÿ“Š How Your Data is Used

Non-profit creates public web archive. OpenAI & Anthropic each donated $250k in 2023. Google's C4 dataset is filtered Common Crawl.

๐Ÿšซ What Happens If You Block

Your content won't appear in Common Crawl archive. Indirectly affects GPT, Claude, Gemini, Llama training.

๐Ÿ’ก Good to Know

CRITICAL: 80% of LLM training data comes from Common Crawl. Cited in 10,000+ academic papers. Blocking affects ALL major AI models indirectly.

๐ŸขAbout Common Crawl Foundation

Common Crawl Foundation logo
Common Crawl Foundation

Common Crawl Foundation operates 1 known bot for AI model training. Their service reaches Powers training for OpenAI, Anthropic, Google, Meta, Amazon, Nvidia.

๐Ÿ›ก๏ธCCBot robots.txt Configuration

Control CCBot access to your website using robots.txt directives.

Block CCBot

To completely block CCBot from crawling your site:

User-agent: CCBot
Disallow: /

Allow CCBot Full Access

To explicitly allow CCBot to crawl your entire site:

User-agent: CCBot
Allow: /

Selective Access for CCBot

To allow CCBot but restrict certain directories:

User-agent: CCBot
Disallow: /private/
Disallow: /api/
Disallow: /admin/
Allow: /

โœ“ CCBot respects robots.txt directives.

CCBot User-Agent String

The user-agent token for CCBot is:

CCBot

Check Your Site's AI Policy

See if you're blocking or allowing CCBot and other AI crawlers.