AI Bot Visibility Checker - Cloudflare Experiments

This is an experimental Worker. Use it as a starting point for your own projects.

The AI Bot Visibility Checker analyzes a webpage’s robots.txt file and page-level directives to determine whether known AI crawlers (like GPTBot, ClaudeBot, PerplexityBot) are allowed or blocked. It provides per-crawler status and a summary by platform.

Disclaimer: This tool reports configuration only. It does not verify whether a URL is actually indexed or used by any AI product, as there are no public APIs for that information.

Features

Check 14+ known AI crawler bots in one request
Parse robots.txt with user-agent specific rules
Detect page-level blocks (meta robots, X-Robots-Tag headers)
Support for generic directives (noindex, noai, noimageai)
Stateless edge execution (no bindings required)
Sub-second response times

API Reference

GET /check

Check AI crawler visibility configuration for a URL.

url

string

required

The target URL to check. Must use http:// or https:// scheme.

Response

url

string

The URL that was analyzed.

disclaimer

string

Reminder that this reports configuration only, not actual indexing status.

crawlers

array

Array of crawler status objects.

Show Crawler Object

string

The crawler’s User-Agent token (e.g., “GPTBot”, “ClaudeBot”).

platform

string

Human-readable platform name (e.g., “ChatGPT”, “Claude”).

status

string

One of:

"allowed" - robots.txt and page directives allow this crawler
"blocked" - robots.txt or page-level directive blocks this crawler
"not_specified" - no rule applies for this crawler

summary

object

Aggregated results by status.

Show Summary Object

allowed

string[]

Array of platform names that are allowed.

blocked

string[]

Array of platform names that are blocked.

notSpecified

string[]

Array of platform names with no specific rules.

Example Request

curl "https://your-worker.workers.dev/check?url=https://www.cloudflare.com"

Example Response

{
  "url": "https://www.cloudflare.com",
  "disclaimer": "Configuration only; we cannot verify actual index inclusion in any AI product.",
  "crawlers": [
    {
      "id": "GPTBot",
      "platform": "ChatGPT",
      "status": "allowed"
    },
    {
      "id": "ChatGPT-User",
      "platform": "ChatGPT",
      "status": "allowed"
    },
    {
      "id": "ClaudeBot",
      "platform": "Claude",
      "status": "not_specified"
    },
    {
      "id": "PerplexityBot",
      "platform": "Perplexity",
      "status": "blocked"
    },
    {
      "id": "Google-Extended",
      "platform": "Google (Gemini/Bard)",
      "status": "not_specified"
    }
    // ... 9 more crawlers
  ],
  "summary": {
    "allowed": ["ChatGPT"],
    "blocked": ["Perplexity"],
    "notSpecified": [
      "Claude",
      "ChatGPT Search",
      "Google (Gemini/Bard)",
      "Google Vertex AI",
      "Common Crawl",
      "ByteDance",
      "Meta AI",
      "Meta",
      "Apple",
      "Amazon",
      "DuckDuckGo"
    ]
  }
}

Error Responses

error

string

Human-readable error message.

code

string

Machine-readable error code.

400 Bad Request - Missing or invalid URL:

{
  "error": "Missing or invalid query parameter: url",
  "code": "INVALID_URL"
}

502 Bad Gateway - Failed to fetch the URL:

{
  "error": "Failed to fetch URL",
  "code": "FETCH_ERROR"
}

Monitored AI Crawlers

The worker checks the following AI crawlers:

Crawler ID	Platform	Description
`GPTBot`	ChatGPT	OpenAI’s web crawler
`ChatGPT-User`	ChatGPT	ChatGPT user-facing bot
`OAI-SearchBot`	ChatGPT Search	OpenAI’s search crawler
`ClaudeBot`	Claude	Anthropic’s web crawler
`PerplexityBot`	Perplexity	Perplexity AI’s crawler
`Google-Extended`	Google (Gemini/Bard)	Google’s Gemini crawler
`Google-CloudVertexBot`	Google Vertex AI	Google Vertex AI crawler
`CCBot`	Common Crawl	Common Crawl corpus builder
`Bytespider`	ByteDance	ByteDance’s crawler
`Meta-ExternalAgent`	Meta AI	Meta’s AI crawler
`FacebookBot`	Meta	Facebook’s web crawler
`Applebot`	Apple	Apple’s web crawler
`Amazonbot`	Amazon	Amazon’s web crawler
`DuckAssistBot`	DuckDuckGo	DuckDuckGo AI assistant

This list is defined in src/constants/crawlers.ts and can be customized for your needs.

Implementation Details

Detection Logic

The worker uses a multi-layered approach to determine crawler visibility:

Fetch robots.txt and page content

Both requests are made in parallel:

const [pageResult, robotsBody] = await Promise.all([
  fetchPage(url),
  fetchRobotsTxt(origin),
]);

Parse robots.txt rules

Extract per-user-agent Allow/Disallow directives:

const rules = robotsBody ? parseRobotsTxt(robotsBody) : new Map();

The parser handles:

User-agent specific rules
Wildcard user-agents (*)
Path-specific allow/disallow patterns

Extract page-level signals

Check HTML meta tags and HTTP headers:

// Detects:
// <meta name="robots" content="noindex, noai">
// <meta name="GPTBot" content="noindex">
// X-Robots-Tag: noindex, nofollow
const signals = getPageRobotSignals(html, headers);

Compute per-crawler status

Apply precedence rules for each crawler:

// 1. Page-level block -> blocked
// 2. robots.txt disallow -> blocked
// 3. robots.txt allow -> allowed
// 4. No rule -> not_specified
const status = computeStatus(rules, signals, crawlerId, path);

Core Implementation

Here’s the main route handler from src/routes/check.ts:

app.get("/check", async (c) => {
  const url = validateUrl(c.req.query("url"));
  if (!url)
    return jsonError(c, "Missing or invalid query parameter: url", "INVALID_URL");

  try {
    const parsed = new URL(url);
    const origin = parsed.origin;
    const path = parsed.pathname || "/";

    const [pageResult, robotsBody] = await Promise.all([
      fetchPage(url),
      fetchRobotsTxt(origin),
    ]);

    const rules = robotsBody ? parseRobotsTxt(robotsBody) : new Map();
    const response = buildVisibilityResponse(
      url,
      path,
      rules,
      pageResult.html,
      pageResult.headers
    );
    return jsonSuccess(c, response);
  } catch (e) {
    const message = e instanceof Error ? e.message : "Failed to fetch URL";
    return jsonError(c, message, "FETCH_ERROR", 502);
  }
});

Status Precedence

The visibility logic follows this precedence (from src/lib/visibility.ts):

function computeStatus(
  rules: RobotsRules,
  signals: PageRobotSignals,
  crawlerId: string,
  path: string
): CrawlerStatus {
  // Page-level block takes highest precedence
  if (isCrawlerBlockedByPage(signals, crawlerId)) return "blocked";
  
  // Then check robots.txt
  const robotsAllow = isAllowedByRobots(rules, crawlerId, path);
  if (robotsAllow === false) return "blocked";
  if (robotsAllow === true) return "allowed";
  
  // No rule specified
  return "not_specified";
}

Page-Level Directives

The worker detects these meta tags and headers:

<!-- Generic blocks (apply to all crawlers) -->
<meta name="robots" content="noindex">
<meta name="robots" content="noai">
<meta name="robots" content="noimageai">

<!-- Crawler-specific blocks -->
<meta name="GPTBot" content="noindex">
<meta name="ClaudeBot" content="nofollow, noindex">

<!-- HTTP header equivalent -->
X-Robots-Tag: noindex, noai

robots.txt Parsing

Example robots.txt rules:

# Block specific crawlers
User-agent: GPTBot
Disallow: /

User-agent: PerplexityBot
Disallow: /private/

# Allow all others
User-agent: *
Allow: /

The parser extracts per-user-agent rules and applies path matching with longest-prefix-wins logic.

Setup & Deployment

Prerequisites

Node.js v18 or later
A Cloudflare account
Wrangler CLI installed

Local Development

Clone and install dependencies

git clone https://github.com/shrinathsnayak/cloudflare-experiments
cd cloudflare-experiments/experiments/ai-bot-visibility
npm install

Start the development server

npm run dev

This starts Wrangler in dev mode. No bindings required.

Test the endpoint

curl "http://localhost:8787/check?url=https://www.cloudflare.com"

Deploy to Production

Authenticate with Cloudflare

wrangler login

Deploy the Worker

npm run deploy

This publishes your Worker to *.workers.dev or your custom domain.

Test the production endpoint

curl "https://ai-bot-visibility.YOUR_SUBDOMAIN.workers.dev/check?url=https://example.com"

One-Click Deploy

Click the button above to deploy this Worker directly to your Cloudflare account. You can fork the repository and update the URL to deploy from your own fork.

Configuration

No bindings or environment variables are required. The wrangler.toml is minimal:

name = "ai-bot-visibility"
main = "src/index.ts"
compatibility_date = "2024-01-01"

Dependencies

{
  "dependencies": {
    "hono": "^4.6.12"
  },
  "devDependencies": {
    "@cloudflare/workers-types": "^4.20241127.0",
    "typescript": "^5.7.2",
    "wrangler": "^4"
  }
}

Cloudflare Features Used

Workers - Serverless execution environment
Fetch API - HTTP client for fetching pages and robots.txt
Edge Networking - Low-latency requests from global edge locations

Use Cases

SEO Tools - Add AI crawler visibility checks to SEO audit dashboards
CMS Plugins - Integrate into WordPress/Drupal to show AI bot status
Privacy Compliance - Monitor which AI platforms can access your content
Analytics Dashboards - Track AI crawler access policies across sites
Browser Extensions - Show AI visibility status for the current page
Web Scraping Tools - Check if your scraper is allowed before crawling

Limitations

Configuration only: Does not verify actual indexing by AI platforms (no public APIs exist)
Static analysis: Does not execute JavaScript; only analyzes HTML and headers
No authentication: Cannot check auth-protected pages
robots.txt compliance: Assumes crawlers respect robots.txt (not legally enforced)
Limited crawlers: Only checks 14 known AI crawlers (list can be extended)
No sitemap parsing: Only checks individual URLs, not entire sitemaps

Advanced Usage

Adding Custom Crawlers

Extend the crawler list in src/constants/crawlers.ts:

export const AI_CRAWLERS: Array<{ id: string; platform: string }> = [
  { id: "GPTBot", platform: "ChatGPT" },
  { id: "ClaudeBot", platform: "Claude" },
  // Add your custom crawlers
  { id: "MyCustomBot", platform: "My AI Platform" },
  // ...
];

Batch Checking Multiple URLs

Create a new endpoint to check multiple URLs:

app.post("/batch-check", async (c) => {
  const { urls } = await c.req.json();
  const results = await Promise.all(
    urls.map(url => checkVisibility(url))
  );
  return c.json({ results });
});

Adding Site-Wide Analysis

Fetch and parse the sitemap to check all pages:

import { parseSitemap } from './lib/sitemap';

app.get("/site-check", async (c) => {
  const sitemapUrl = c.req.query("sitemap");
  const urls = await parseSitemap(sitemapUrl);
  // Check each URL...
});

robots.txt Specification

Learn about robots.txt syntax and best practices

Meta Robots Tag

Documentation for meta robots directives

AI Crawl Control

Cloudflare’s AI crawler blocking feature

GitHub Repository

View the complete source code

FAQ

Why does it say 'not_specified' for most crawlers?

If a site doesn’t explicitly allow or block a crawler in robots.txt or meta tags, the status is not_specified. This means the crawler could access the page by default, but there’s no explicit rule.

Does 'allowed' mean the page is definitely in ChatGPT/Claude?

No. allowed means the page configuration doesn’t block the crawler. Actual indexing depends on many factors: crawler behavior, content quality, update frequency, and internal platform decisions.

Can I check private/authenticated pages?

Not with the default implementation. You’d need to add authentication headers to the fetch requests and handle cookies/sessions.

How accurate is the robots.txt parsing?

The parser handles standard robots.txt directives (User-agent, Allow, Disallow) with path prefix matching. It doesn’t support regex patterns or advanced extensions.

What if a crawler ignores robots.txt?

This tool reports configuration only. If a crawler doesn’t respect robots.txt (which is voluntary), you’d need to block it at the network level (firewall, CDN rules, etc.).

Next Steps

More Experiments

Explore other Cloudflare experiments

Workers Documentation

Learn more about Cloudflare Workers

Hono Framework

Explore the Hono web framework

Contribute

Contribute to the experiments repository

​Features

​API Reference

​GET /check

​Response

​Example Request

​Example Response

​Error Responses

​Monitored AI Crawlers

​Implementation Details

​Detection Logic

​Core Implementation

​Status Precedence

​Page-Level Directives

​robots.txt Parsing

​Setup & Deployment

​Prerequisites

​Local Development

​Deploy to Production

​One-Click Deploy

​Configuration

​Dependencies

​Cloudflare Features Used

​Use Cases

​Limitations

​Advanced Usage

​Adding Custom Crawlers

​Batch Checking Multiple URLs

​Adding Site-Wide Analysis

​Related Resources

robots.txt Specification

Meta Robots Tag

AI Crawl Control

GitHub Repository

​FAQ

​Next Steps

More Experiments

Workers Documentation

Hono Framework

Contribute

Features

API Reference

GET /check

Response

Example Request

Example Response

Error Responses

Monitored AI Crawlers

Implementation Details

Detection Logic

Core Implementation

Status Precedence

Page-Level Directives

robots.txt Parsing

Setup & Deployment

Prerequisites

Local Development

Deploy to Production

One-Click Deploy

Configuration

Dependencies

Cloudflare Features Used

Use Cases

Limitations

Advanced Usage

Adding Custom Crawlers

Batch Checking Multiple URLs

Adding Site-Wide Analysis

Related Resources

FAQ

Next Steps