CCBot
Tier 2Open web archive used by many AI companies for training data.
CCBot๐ฏWhat is CCBot?
CCBot is an AI training crawler operated by Common Crawl Foundation. Collects data to train AI models.
Non-profit creates public web archive. OpenAI & Anthropic each donated $250k in 2023. Google's C4 dataset is filtered Common Crawl.
Your content won't appear in Common Crawl archive. Indirectly affects GPT, Claude, Gemini, Llama training.
CRITICAL: 80% of LLM training data comes from Common Crawl. Cited in 10,000+ academic papers. Blocking affects ALL major AI models indirectly.
๐ขAbout Common Crawl Foundation
Common Crawl Foundation operates 1 known bot for AI model training. Their service reaches Powers training for OpenAI, Anthropic, Google, Meta, Amazon, Nvidia.
๐ก๏ธCCBot robots.txt Configuration
Control CCBot access to your website using robots.txt directives.
Block CCBot
To completely block CCBot from crawling your site:
User-agent: CCBot
Disallow: /Allow CCBot Full Access
To explicitly allow CCBot to crawl your entire site:
User-agent: CCBot
Allow: /Selective Access for CCBot
To allow CCBot but restrict certain directories:
User-agent: CCBot
Disallow: /private/
Disallow: /api/
Disallow: /admin/
Allow: /โ CCBot respects robots.txt directives.
CCBot User-Agent String
The user-agent token for CCBot is:
CCBot๐Who Blocks CCBot?
Check Your Site's AI Policy
See if you're blocking or allowing CCBot and other AI crawlers.