๐ŸฆŠStackFox
Internet Archive logo

Internet Archive

Tier 2
๐Ÿ“š AI Trainingby Internet Archive โ†—ยท Since 1996

Preserves web pages for the Wayback Machine. Non-profit digital preservation.

User-Agent Token
ia_archiver
Respects robots.txt
Yes
Impact Level
Major
100M+ users - Meta, Apple, Microsoft, Perplexity, xAI
Estimated Reach
Billions of archived pages accessed monthly

๐ŸŽฏWhat is Internet Archive?

Internet Archive is an AI training crawler operated by Internet Archive. Collects data to train AI models.

๐Ÿ“Š How Your Data is Used

Archived pages are publicly accessible for historical research and reference.

๐Ÿšซ What Happens If You Block

Your site won't be preserved in the Wayback Machine for historical reference.

๐Ÿ’ก Good to Know

Non-profit organization. Blocking prevents historical preservation. Many consider this culturally important.

๐ŸขAbout Internet Archive

Internet Archive logo
Internet Archive

Internet Archive operates 2 known bots for AI model training. Their service reaches Billions of archived pages accessed monthly.

๐Ÿ›ก๏ธia_archiver robots.txt Configuration

Control ia_archiver access to your website using robots.txt directives.

Block ia_archiver

To completely block Internet Archive from crawling your site:

User-agent: ia_archiver
Disallow: /

Allow ia_archiver Full Access

To explicitly allow Internet Archive to crawl your entire site:

User-agent: ia_archiver
Allow: /

Selective Access for ia_archiver

To allow Internet Archive but restrict certain directories:

User-agent: ia_archiver
Disallow: /private/
Disallow: /api/
Disallow: /admin/
Allow: /

โœ“ Internet Archive respects robots.txt directives.

ia_archiver User-Agent String

The user-agent token for Internet Archive is:

ia_archiver

๐Ÿ”—Other Internet Archive Bots

Check Your Site's AI Policy

See if you're blocking or allowing Internet Archive and other AI crawlers.