# AI Bot Visibility Checker (/docs/experiments/ai-bot-visibility) This is an experimental Worker. Use it as a starting point for your own projects. The AI Bot Visibility Checker analyzes a webpage's `robots.txt` file and page-level directives to determine whether known AI crawlers (like GPTBot, ClaudeBot, PerplexityBot) are allowed or blocked. It provides per-crawler status and a summary by platform. **Disclaimer**: This tool reports **configuration only**. It does not verify whether a URL is actually indexed or used by any AI product, as there are no public APIs for that information. ## Features [#features] * Check 14+ known AI crawler bots in one request * Parse `robots.txt` with user-agent specific rules * Detect page-level blocks (meta robots, X-Robots-Tag headers) * Support for generic directives (`noindex`, `noai`, `noimageai`) * Stateless edge execution (no bindings required) * Sub-second response times ## API Reference [#api-reference] ### GET /check [#get-check] Check AI crawler visibility configuration for a URL. #### Response [#response] **`url`** `string` The URL that was analyzed. **`disclaimer`** `string` Reminder that this reports configuration only, not actual indexing status. **`crawlers`** `array` Array of crawler status objects. **`id`** `string` The crawler's User-Agent token (e.g., "GPTBot", "ClaudeBot"). **`platform`** `string` Human-readable platform name (e.g., "ChatGPT", "Claude"). **`status`** `string` One of: - `"allowed"` - robots.txt and page directives allow this crawler - `"blocked"` - robots.txt or page-level directive blocks this crawler - `"not_specified"` - no rule applies for this crawler **`summary`** `object` Aggregated results by status. **`allowed`** `string[]` Array of platform names that are allowed. **`blocked`** `string[]` Array of platform names that are blocked. **`notSpecified`** `string[]` Array of platform names with no specific rules. #### Example Request [#example-request] ```bash curl "https://your-worker.workers.dev/check?url=https://www.cloudflare.com" ``` #### Example Response [#example-response] ```json { "url": "https://www.cloudflare.com", "disclaimer": "Configuration only; we cannot verify actual index inclusion in any AI product.", "crawlers": [ { "id": "GPTBot", "platform": "ChatGPT", "status": "allowed" }, { "id": "ChatGPT-User", "platform": "ChatGPT", "status": "allowed" }, { "id": "ClaudeBot", "platform": "Claude", "status": "not_specified" }, { "id": "PerplexityBot", "platform": "Perplexity", "status": "blocked" }, { "id": "Google-Extended", "platform": "Google (Gemini/Bard)", "status": "not_specified" } // ... 9 more crawlers ], "summary": { "allowed": ["ChatGPT"], "blocked": ["Perplexity"], "notSpecified": [ "Claude", "ChatGPT Search", "Google (Gemini/Bard)", "Google Vertex AI", "Common Crawl", "ByteDance", "Meta AI", "Meta", "Apple", "Amazon", "DuckDuckGo" ] } } ``` #### Error Responses [#error-responses] **`error`** `string` Human-readable error message. **`code`** `string` Machine-readable error code. **400 Bad Request** - Missing or invalid URL: ```json { "error": "Missing or invalid query parameter: url", "code": "INVALID_URL" } ``` **502 Bad Gateway** - Failed to fetch the URL: ```json { "error": "Failed to fetch URL", "code": "FETCH_ERROR" } ``` ## Monitored AI Crawlers [#monitored-ai-crawlers] The worker checks the following AI crawlers: | Crawler ID | Platform | Description | | ----------------------- | -------------------- | --------------------------- | | `GPTBot` | ChatGPT | OpenAI's web crawler | | `ChatGPT-User` | ChatGPT | ChatGPT user-facing bot | | `OAI-SearchBot` | ChatGPT Search | OpenAI's search crawler | | `ClaudeBot` | Claude | Anthropic's web crawler | | `PerplexityBot` | Perplexity | Perplexity AI's crawler | | `Google-Extended` | Google (Gemini/Bard) | Google's Gemini crawler | | `Google-CloudVertexBot` | Google Vertex AI | Google Vertex AI crawler | | `CCBot` | Common Crawl | Common Crawl corpus builder | | `Bytespider` | ByteDance | ByteDance's crawler | | `Meta-ExternalAgent` | Meta AI | Meta's AI crawler | | `FacebookBot` | Meta | Facebook's web crawler | | `Applebot` | Apple | Apple's web crawler | | `Amazonbot` | Amazon | Amazon's web crawler | | `DuckAssistBot` | DuckDuckGo | DuckDuckGo AI assistant | This list is defined in `src/constants/crawlers.ts` and can be customized for your needs. ## Implementation Details [#implementation-details] ### Detection Logic [#detection-logic] The worker uses a multi-layered approach to determine crawler visibility: ### Fetch robots.txt and page content [#fetch-robotstxt-and-page-content] Both requests are made in parallel: ```typescript const [pageResult, robotsBody] = await Promise.all([ fetchPage(url), fetchRobotsTxt(origin), ]); ``` ### Parse robots.txt rules [#parse-robotstxt-rules] Extract per-user-agent Allow/Disallow directives: ```typescript const rules = robotsBody ? parseRobotsTxt(robotsBody) : new Map(); ``` The parser handles: * User-agent specific rules * Wildcard user-agents (`*`) * Path-specific allow/disallow patterns ### Extract page-level signals [#extract-page-level-signals] Check HTML meta tags and HTTP headers: ```typescript // Detects: // // // X-Robots-Tag: noindex, nofollow const signals = getPageRobotSignals(html, headers); ``` ### Compute per-crawler status [#compute-per-crawler-status] Apply precedence rules for each crawler: ```typescript // 1. Page-level block -> blocked // 2. robots.txt disallow -> blocked // 3. robots.txt allow -> allowed // 4. No rule -> not_specified const status = computeStatus(rules, signals, crawlerId, path); ``` ### Core Implementation [#core-implementation] Here's the main route handler from `src/routes/check.ts`: ```typescript app.get("/check", async (c) => { const url = validateUrl(c.req.query("url")); if (!url) return jsonError(c, "Missing or invalid query parameter: url", "INVALID_URL"); try { const parsed = new URL(url); const origin = parsed.origin; const path = parsed.pathname || "/"; const [pageResult, robotsBody] = await Promise.all([fetchPage(url), fetchRobotsTxt(origin)]); const rules = robotsBody ? parseRobotsTxt(robotsBody) : new Map(); const response = buildVisibilityResponse(url, path, rules, pageResult.html, pageResult.headers); return jsonSuccess(c, response); } catch (e) { const message = e instanceof Error ? e.message : "Failed to fetch URL"; return jsonError(c, message, "FETCH_ERROR", 502); } }); ``` ### Status Precedence [#status-precedence] The visibility logic follows this precedence (from `src/lib/visibility.ts`): ```typescript function computeStatus( rules: RobotsRules, signals: PageRobotSignals, crawlerId: string, path: string ): CrawlerStatus { // Page-level block takes highest precedence if (isCrawlerBlockedByPage(signals, crawlerId)) return "blocked"; // Then check robots.txt const robotsAllow = isAllowedByRobots(rules, crawlerId, path); if (robotsAllow === false) return "blocked"; if (robotsAllow === true) return "allowed"; // No rule specified return "not_specified"; } ``` ### Page-Level Directives [#page-level-directives] The worker detects these meta tags and headers: ```html X-Robots-Tag: noindex, noai ``` ### robots.txt Parsing [#robotstxt-parsing] Example `robots.txt` rules: ```txt # Block specific crawlers User-agent: GPTBot Disallow: / User-agent: PerplexityBot Disallow: /private/ # Allow all others User-agent: * Allow: / ``` The parser extracts per-user-agent rules and applies path matching with longest-prefix-wins logic. ## Advanced Usage [#advanced-usage] ### Adding Custom Crawlers [#adding-custom-crawlers] Extend the crawler list in `src/constants/crawlers.ts`: ```typescript export const AI_CRAWLERS: Array<{ id: string; platform: string }> = [ { id: "GPTBot", platform: "ChatGPT" }, { id: "ClaudeBot", platform: "Claude" }, // Add your custom crawlers { id: "MyCustomBot", platform: "My AI Platform" }, // ... ]; ``` ### Batch Checking Multiple URLs [#batch-checking-multiple-urls] Create a new endpoint to check multiple URLs: ```typescript app.post("/batch-check", async (c) => { const { urls } = await c.req.json(); const results = await Promise.all(urls.map((url) => checkVisibility(url))); return c.json({ results }); }); ``` ### Adding Site-Wide Analysis [#adding-site-wide-analysis] Fetch and parse the sitemap to check all pages: ```typescript import { parseSitemap } from "./lib/sitemap"; app.get("/site-check", async (c) => { const sitemapUrl = c.req.query("sitemap"); const urls = await parseSitemap(sitemapUrl); // Check each URL... }); ``` ## FAQ [#faq] If a site doesn't explicitly allow or block a crawler in robots.txt or meta tags, the status is `not_specified`. This means the crawler could access the page by default, but there's no explicit rule. No. `allowed` means the page configuration doesn't block the crawler. Actual indexing depends on many factors: crawler behavior, content quality, update frequency, and internal platform decisions. Not with the default implementation. You'd need to add authentication headers to the fetch requests and handle cookies/sessions. The parser handles standard robots.txt directives (User-agent, Allow, Disallow) with path prefix matching. It doesn't support regex patterns or advanced extensions. This tool reports configuration only. If a crawler doesn't respect robots.txt (which is voluntary), you'd need to block it at the network level (firewall, CDN rules, etc.). ## Use Cases [#use-cases] * **SEO Tools** - Add AI crawler visibility checks to SEO audit dashboards * **CMS Plugins** - Integrate into WordPress/Drupal to show AI bot status * **Privacy Compliance** - Monitor which AI platforms can access your content * **Analytics Dashboards** - Track AI crawler access policies across sites * **Browser Extensions** - Show AI visibility status for the current page * **Web Scraping Tools** - Check if your scraper is allowed before crawling ## Limitations [#limitations] * **Configuration only**: Does not verify actual indexing by AI platforms (no public APIs exist) * **Static analysis**: Does not execute JavaScript; only analyzes HTML and headers * **No authentication**: Cannot check auth-protected pages * **robots.txt compliance**: Assumes crawlers respect robots.txt (not legally enforced) * **Limited crawlers**: Only checks 14 known AI crawlers (list can be extended) * **No sitemap parsing**: Only checks individual URLs, not entire sitemaps ## Deployment [#deployment] ### Click the deploy button [#click-the-deploy-button] [![Deploy to Cloudflare Workers](https://deploy.workers.cloudflare.com/button)](https://deploy.workers.cloudflare.com/?url=https://github.com/shrinathsnayak/cloudflare-experiments/tree/main/apps/experiments/ai-bot-visibility) ### Deploy [#deploy] Follow the deployment wizard to deploy the Worker to your Cloudflare account. No additional configuration or bindings required. ### Test your deployment [#test-your-deployment] ```bash curl "https://your-worker.workers.dev/check?url=https://www.cloudflare.com" ``` ## Local Development [#local-development] ```bash cd apps/experiments/ai-bot-visibility npm install npm run dev ``` Test locally: ```bash curl "http://localhost:8787/check?url=https://www.cloudflare.com" ``` ## Configuration [#configuration] No bindings or environment variables are required. The `wrangler.json` is minimal: ```json { "name": "ai-bot-visibility", "main": "src/index.ts", "compatibility_date": "2024-01-01" } ``` ### Dependencies [#dependencies] ```json { "dependencies": { "hono": "^4.6.12" }, "devDependencies": { "@cloudflare/workers-types": "^4.20241127.0", "typescript": "^5.7.2", "wrangler": "^4" } } ``` ## Cloudflare Features Used [#cloudflare-features-used] * **[Workers](https://developers.cloudflare.com/workers/)** - Serverless execution environment * **[Fetch API](https://developers.cloudflare.com/workers/runtime-apis/fetch/)** - HTTP client for fetching pages and robots.txt * **[Edge network](https://developers.cloudflare.com/workers/reference/how-workers-works/)** - Low-latency requests from global edge locations ## Next Steps [#next-steps] Learn about robots.txt syntax and best practices Documentation for meta robots directives Cloudflare's AI crawler blocking feature View the complete source code