Internet Archive
Tier 2Preserves web pages for the Wayback Machine. Non-profit digital preservation.
ia_archiver๐ฏWhat is Internet Archive?
Internet Archive is an AI training crawler operated by Internet Archive. Collects data to train AI models.
Archived pages are publicly accessible for historical research and reference.
Your site won't be preserved in the Wayback Machine for historical reference.
Non-profit organization. Blocking prevents historical preservation. Many consider this culturally important.
๐ขAbout Internet Archive
Internet Archive operates 2 known bots for AI model training. Their service reaches Billions of archived pages accessed monthly.
๐ก๏ธia_archiver robots.txt Configuration
Control ia_archiver access to your website using robots.txt directives.
Block ia_archiver
To completely block Internet Archive from crawling your site:
User-agent: ia_archiver
Disallow: /Allow ia_archiver Full Access
To explicitly allow Internet Archive to crawl your entire site:
User-agent: ia_archiver
Allow: /Selective Access for ia_archiver
To allow Internet Archive but restrict certain directories:
User-agent: ia_archiver
Disallow: /private/
Disallow: /api/
Disallow: /admin/
Allow: /โ Internet Archive respects robots.txt directives.
ia_archiver User-Agent String
The user-agent token for Internet Archive is:
ia_archiver๐Who Blocks Internet Archive?
๐Other Internet Archive Bots
Check Your Site's AI Policy
See if you're blocking or allowing Internet Archive and other AI crawlers.