AI Bot Visibility Checker

Check if a URL is configured to allow or block AI crawlers like GPTBot, ClaudeBot, and others

This is an experimental Worker. Use it as a starting point for your own projects.

The AI Bot Visibility Checker analyzes a webpage's robots.txt file and page-level directives to determine whether known AI crawlers (like GPTBot, ClaudeBot, PerplexityBot) are allowed or blocked. It provides per-crawler status and a summary by platform.

Disclaimer: This tool reports configuration only. It does not verify whether a URL is actually indexed or used by any AI product, as there are no public APIs for that information.

Features

Check 14+ known AI crawler bots in one request
Parse robots.txt with user-agent specific rules
Detect page-level blocks (meta robots, X-Robots-Tag headers)
Support for generic directives (noindex, noai, noimageai)
Stateless edge execution (no bindings required)
Sub-second response times

API Reference

GET /check

Check AI crawler visibility configuration for a URL.

Prop

Type

Response

url string

The URL that was analyzed.

disclaimer string

Reminder that this reports configuration only, not actual indexing status.

crawlers array

Array of crawler status objects.

summary object

Aggregated results by status.

Example Request

curl "https://your-worker.workers.dev/check?url=https://www.cloudflare.com"

Example Response

{
  "url": "https://www.cloudflare.com",
  "disclaimer": "Configuration only; we cannot verify actual index inclusion in any AI product.",
  "crawlers": [
    {
      "id": "GPTBot",
      "platform": "ChatGPT",
      "status": "allowed"
    },
    {
      "id": "ChatGPT-User",
      "platform": "ChatGPT",
      "status": "allowed"
    },
    {
      "id": "ClaudeBot",
      "platform": "Claude",
      "status": "not_specified"
    },
    {
      "id": "PerplexityBot",
      "platform": "Perplexity",
      "status": "blocked"
    },
    {
      "id": "Google-Extended",
      "platform": "Google (Gemini/Bard)",
      "status": "not_specified"
    }
    // ... 9 more crawlers
  ],
  "summary": {
    "allowed": ["ChatGPT"],
    "blocked": ["Perplexity"],
    "notSpecified": [
      "Claude",
      "ChatGPT Search",
      "Google (Gemini/Bard)",
      "Google Vertex AI",
      "Common Crawl",
      "ByteDance",
      "Meta AI",
      "Meta",
      "Apple",
      "Amazon",
      "DuckDuckGo"
    ]
  }
}

Error Responses

error string

Human-readable error message.

code string

Machine-readable error code.

400 Bad Request - Missing or invalid URL:

{
  "error": "Missing or invalid query parameter: url",
  "code": "INVALID_URL"
}

502 Bad Gateway - Failed to fetch the URL:

{
  "error": "Failed to fetch URL",
  "code": "FETCH_ERROR"
}

Monitored AI Crawlers

The worker checks the following AI crawlers:

Crawler ID	Platform	Description
`GPTBot`	ChatGPT	OpenAI's web crawler
`ChatGPT-User`	ChatGPT	ChatGPT user-facing bot
`OAI-SearchBot`	ChatGPT Search	OpenAI's search crawler
`ClaudeBot`	Claude	Anthropic's web crawler
`PerplexityBot`	Perplexity	Perplexity AI's crawler
`Google-Extended`	Google (Gemini/Bard)	Google's Gemini crawler
`Google-CloudVertexBot`	Google Vertex AI	Google Vertex AI crawler
`CCBot`	Common Crawl	Common Crawl corpus builder
`Bytespider`	ByteDance	ByteDance's crawler
`Meta-ExternalAgent`	Meta AI	Meta's AI crawler
`FacebookBot`	Meta	Facebook's web crawler
`Applebot`	Apple	Apple's web crawler
`Amazonbot`	Amazon	Amazon's web crawler
`DuckAssistBot`	DuckDuckGo	DuckDuckGo AI assistant

This list is defined in src/constants/crawlers.ts and can be customized for your needs.

Implementation Details

Detection Logic

The worker uses a multi-layered approach to determine crawler visibility:

Fetch robots.txt and page content

Both requests are made in parallel:

const [pageResult, robotsBody] = await Promise.all([
  fetchPage(url),
  fetchRobotsTxt(origin),
]);

Parse robots.txt rules

Extract per-user-agent Allow/Disallow directives:

const rules = robotsBody ? parseRobotsTxt(robotsBody) : new Map();

The parser handles:

User-agent specific rules
Wildcard user-agents (*)
Path-specific allow/disallow patterns

Extract page-level signals

Check HTML meta tags and HTTP headers:

// Detects:
// <meta name="robots" content="noindex, noai">
// <meta name="GPTBot" content="noindex">
// X-Robots-Tag: noindex, nofollow
const signals = getPageRobotSignals(html, headers);

Compute per-crawler status

Apply precedence rules for each crawler:

// 1. Page-level block -> blocked
// 2. robots.txt disallow -> blocked
// 3. robots.txt allow -> allowed
// 4. No rule -> not_specified
const status = computeStatus(rules, signals, crawlerId, path);

Core Implementation

Here's the main route handler from src/routes/check.ts:

app.get("/check", async (c) => {
  const url = validateUrl(c.req.query("url"));
  if (!url) return jsonError(c, "Missing or invalid query parameter: url", "INVALID_URL");

  try {
    const parsed = new URL(url);
    const origin = parsed.origin;
    const path = parsed.pathname || "/";

    const [pageResult, robotsBody] = await Promise.all([fetchPage(url), fetchRobotsTxt(origin)]);

    const rules = robotsBody ? parseRobotsTxt(robotsBody) : new Map();
    const response = buildVisibilityResponse(url, path, rules, pageResult.html, pageResult.headers);
    return jsonSuccess(c, response);
  } catch (e) {
    const message = e instanceof Error ? e.message : "Failed to fetch URL";
    return jsonError(c, message, "FETCH_ERROR", 502);
  }
});

Status Precedence

The visibility logic follows this precedence (from src/lib/visibility.ts):

function computeStatus(
  rules: RobotsRules,
  signals: PageRobotSignals,
  crawlerId: string,
  path: string
): CrawlerStatus {
  // Page-level block takes highest precedence
  if (isCrawlerBlockedByPage(signals, crawlerId)) return "blocked";

  // Then check robots.txt
  const robotsAllow = isAllowedByRobots(rules, crawlerId, path);
  if (robotsAllow === false) return "blocked";
  if (robotsAllow === true) return "allowed";

  // No rule specified
  return "not_specified";
}

Page-Level Directives

The worker detects these meta tags and headers:

<!-- Generic blocks (apply to all crawlers) -->
<meta name="robots" content="noindex" />
<meta name="robots" content="noai" />
<meta name="robots" content="noimageai" />

<!-- Crawler-specific blocks -->
<meta name="GPTBot" content="noindex" />
<meta name="ClaudeBot" content="nofollow, noindex" />

<!-- HTTP header equivalent -->
X-Robots-Tag: noindex, noai

robots.txt Parsing

Example robots.txt rules:

# Block specific crawlers
User-agent: GPTBot
Disallow: /

User-agent: PerplexityBot
Disallow: /private/

# Allow all others
User-agent: *
Allow: /

The parser extracts per-user-agent rules and applies path matching with longest-prefix-wins logic.

Advanced Usage

Adding Custom Crawlers

Extend the crawler list in src/constants/crawlers.ts:

export const AI_CRAWLERS: Array<{ id: string; platform: string }> = [
  { id: "GPTBot", platform: "ChatGPT" },
  { id: "ClaudeBot", platform: "Claude" },
  // Add your custom crawlers
  { id: "MyCustomBot", platform: "My AI Platform" },
  // ...
];

Batch Checking Multiple URLs

Create a new endpoint to check multiple URLs:

app.post("/batch-check", async (c) => {
  const { urls } = await c.req.json();
  const results = await Promise.all(urls.map((url) => checkVisibility(url)));
  return c.json({ results });
});

Adding Site-Wide Analysis

Fetch and parse the sitemap to check all pages:

import { parseSitemap } from "./lib/sitemap";

app.get("/site-check", async (c) => {
  const sitemapUrl = c.req.query("sitemap");
  const urls = await parseSitemap(sitemapUrl);
  // Check each URL...
});

FAQ

Use Cases

SEO Tools - Add AI crawler visibility checks to SEO audit dashboards
CMS Plugins - Integrate into WordPress/Drupal to show AI bot status
Privacy Compliance - Monitor which AI platforms can access your content
Analytics Dashboards - Track AI crawler access policies across sites
Browser Extensions - Show AI visibility status for the current page
Web Scraping Tools - Check if your scraper is allowed before crawling

Limitations

Configuration only: Does not verify actual indexing by AI platforms (no public APIs exist)
Static analysis: Does not execute JavaScript; only analyzes HTML and headers
No authentication: Cannot check auth-protected pages
robots.txt compliance: Assumes crawlers respect robots.txt (not legally enforced)
Limited crawlers: Only checks 14 known AI crawlers (list can be extended)
No sitemap parsing: Only checks individual URLs, not entire sitemaps

Deployment

Click the deploy button

Deploy

Follow the deployment wizard to deploy the Worker to your Cloudflare account. No additional configuration or bindings required.

Test your deployment

curl "https://your-worker.workers.dev/check?url=https://www.cloudflare.com"

Local Development

cd apps/experiments/ai-bot-visibility
npm install
npm run dev

Test locally:

curl "http://localhost:8787/check?url=https://www.cloudflare.com"

Configuration

No bindings or environment variables are required. The wrangler.json is minimal:

{
  "name": "ai-bot-visibility",
  "main": "src/index.ts",
  "compatibility_date": "2024-01-01"
}

Dependencies

{
  "dependencies": {
    "hono": "^4.6.12"
  },
  "devDependencies": {
    "@cloudflare/workers-types": "^4.20241127.0",
    "typescript": "^5.7.2",
    "wrangler": "^4"
  }
}

Cloudflare Features Used

Workers - Serverless execution environment
Fetch API - HTTP client for fetching pages and robots.txt
Edge network - Low-latency requests from global edge locations

AI Bot Visibility Checker

Features

API Reference

GET /check

Response

Example Request

Example Response

Error Responses

Monitored AI Crawlers

Implementation Details

Detection Logic

Fetch robots.txt and page content

Parse robots.txt rules

Extract page-level signals

Compute per-crawler status

Core Implementation

Status Precedence

Page-Level Directives

robots.txt Parsing

Advanced Usage

Adding Custom Crawlers

Batch Checking Multiple URLs

Adding Site-Wide Analysis

FAQ

Use Cases

Limitations

Deployment

Click the deploy button

Deploy

Test your deployment

Local Development

Configuration

Dependencies

Cloudflare Features Used

Next Steps

robots.txt Specification

Meta Robots Tag

AI Crawl Control

GitHub Repository

On this page

AI Bot Visibility Checker

Crawler Object

Summary Object

Why does it say 'not_specified' for most crawlers?

Does 'allowed' mean the page is definitely in ChatGPT/Claude?

Can I check private/authenticated pages?

How accurate is the robots.txt parsing?

What if a crawler ignores robots.txt?

robots.txt Specification

Meta Robots Tag

AI Crawl Control

GitHub Repository

On this page