robots.txt file and page-level directives to determine whether known AI crawlers (like GPTBot, ClaudeBot, PerplexityBot) are allowed or blocked. It provides per-crawler status and a summary by platform.
Disclaimer: This tool reports configuration only. It does not verify whether a URL is actually indexed or used by any AI product, as there are no public APIs for that information.
Features
- Check 14+ known AI crawler bots in one request
- Parse
robots.txtwith user-agent specific rules - Detect page-level blocks (meta robots, X-Robots-Tag headers)
- Support for generic directives (
noindex,noai,noimageai) - Stateless edge execution (no bindings required)
- Sub-second response times
API Reference
GET /check
Check AI crawler visibility configuration for a URL.The target URL to check. Must use
http:// or https:// scheme.Response
The URL that was analyzed.
Reminder that this reports configuration only, not actual indexing status.
Array of crawler status objects.
Aggregated results by status.
Example Request
Example Response
Error Responses
Human-readable error message.
Machine-readable error code.
Monitored AI Crawlers
The worker checks the following AI crawlers:| Crawler ID | Platform | Description |
|---|---|---|
GPTBot | ChatGPT | OpenAI’s web crawler |
ChatGPT-User | ChatGPT | ChatGPT user-facing bot |
OAI-SearchBot | ChatGPT Search | OpenAI’s search crawler |
ClaudeBot | Claude | Anthropic’s web crawler |
PerplexityBot | Perplexity | Perplexity AI’s crawler |
Google-Extended | Google (Gemini/Bard) | Google’s Gemini crawler |
Google-CloudVertexBot | Google Vertex AI | Google Vertex AI crawler |
CCBot | Common Crawl | Common Crawl corpus builder |
Bytespider | ByteDance | ByteDance’s crawler |
Meta-ExternalAgent | Meta AI | Meta’s AI crawler |
FacebookBot | Meta | Facebook’s web crawler |
Applebot | Apple | Apple’s web crawler |
Amazonbot | Amazon | Amazon’s web crawler |
DuckAssistBot | DuckDuckGo | DuckDuckGo AI assistant |
This list is defined in
src/constants/crawlers.ts and can be customized for your needs.Implementation Details
Detection Logic
The worker uses a multi-layered approach to determine crawler visibility:Parse robots.txt rules
Extract per-user-agent Allow/Disallow directives:The parser handles:
- User-agent specific rules
- Wildcard user-agents (
*) - Path-specific allow/disallow patterns
Core Implementation
Here’s the main route handler fromsrc/routes/check.ts:
Status Precedence
The visibility logic follows this precedence (fromsrc/lib/visibility.ts):
Page-Level Directives
The worker detects these meta tags and headers:robots.txt Parsing
Examplerobots.txt rules:
Setup & Deployment
Prerequisites
- Node.js v18 or later
- A Cloudflare account
- Wrangler CLI installed
Local Development
Deploy to Production
One-Click Deploy
Configuration
No bindings or environment variables are required. Thewrangler.toml is minimal:
Dependencies
Cloudflare Features Used
- Workers - Serverless execution environment
- Fetch API - HTTP client for fetching pages and robots.txt
- Edge Networking - Low-latency requests from global edge locations
Use Cases
- SEO Tools - Add AI crawler visibility checks to SEO audit dashboards
- CMS Plugins - Integrate into WordPress/Drupal to show AI bot status
- Privacy Compliance - Monitor which AI platforms can access your content
- Analytics Dashboards - Track AI crawler access policies across sites
- Browser Extensions - Show AI visibility status for the current page
- Web Scraping Tools - Check if your scraper is allowed before crawling
Limitations
- Configuration only: Does not verify actual indexing by AI platforms (no public APIs exist)
- Static analysis: Does not execute JavaScript; only analyzes HTML and headers
- No authentication: Cannot check auth-protected pages
- robots.txt compliance: Assumes crawlers respect robots.txt (not legally enforced)
- Limited crawlers: Only checks 14 known AI crawlers (list can be extended)
- No sitemap parsing: Only checks individual URLs, not entire sitemaps
Advanced Usage
Adding Custom Crawlers
Extend the crawler list insrc/constants/crawlers.ts:
Batch Checking Multiple URLs
Create a new endpoint to check multiple URLs:Adding Site-Wide Analysis
Fetch and parse the sitemap to check all pages:Related Resources
robots.txt Specification
Learn about robots.txt syntax and best practices
Meta Robots Tag
Documentation for meta robots directives
AI Crawl Control
Cloudflare’s AI crawler blocking feature
GitHub Repository
View the complete source code
FAQ
Why does it say 'not_specified' for most crawlers?
Why does it say 'not_specified' for most crawlers?
If a site doesn’t explicitly allow or block a crawler in robots.txt or meta tags, the status is
not_specified. This means the crawler could access the page by default, but there’s no explicit rule.Does 'allowed' mean the page is definitely in ChatGPT/Claude?
Does 'allowed' mean the page is definitely in ChatGPT/Claude?
No.
allowed means the page configuration doesn’t block the crawler. Actual indexing depends on many factors: crawler behavior, content quality, update frequency, and internal platform decisions.Can I check private/authenticated pages?
Can I check private/authenticated pages?
Not with the default implementation. You’d need to add authentication headers to the fetch requests and handle cookies/sessions.
How accurate is the robots.txt parsing?
How accurate is the robots.txt parsing?
The parser handles standard robots.txt directives (User-agent, Allow, Disallow) with path prefix matching. It doesn’t support regex patterns or advanced extensions.
What if a crawler ignores robots.txt?
What if a crawler ignores robots.txt?
This tool reports configuration only. If a crawler doesn’t respect robots.txt (which is voluntary), you’d need to block it at the network level (firewall, CDN rules, etc.).