Readability Extractor

Extract clean article content from URLs using Browser Rendering and readability heuristics

Load a fully rendered page with Browser Rendering, then strip navigation, ads, and sidebars using readability-style heuristics. Returns title, author (when detectable), body text, word count, and estimated read time.

Features

GET /extract - rendered DOM extraction with readability heuristics
Returns title, author, body, wordCount, readTimeMinutes
Uses @cloudflare/puppeteer like other Browser Rendering experiments

API Reference

GET /extract

url string (required) - Article URL (http or https).

Example Request

curl "https://your-worker.workers.dev/extract?url=https://example.com/article"

Error Codes

400 - INVALID_URL
502 - EXTRACT_ERROR

Use Cases

Build reading-mode or newsletter digest pipelines
Extract main content from JavaScript-heavy news sites
Prototype RAG document ingestion from article URLs

Limitations

Requires Browser Rendering on your account
Heuristic extraction; not identical to Mozilla Readability
Local dev may need wrangler dev --remote for browser binding

Deployment

Click the deploy button

Configure bindings

Browser binding BROWSER and nodejs_compat_v2 in wrangler.json.

Test your deployment

See the experiment README for curl examples.

Local Development

cd apps/experiments/readability-extractor
npm install
npm run dev

Configuration

Browser binding BROWSER and nodejs_compat_v2 in wrangler.json.

Cloudflare Features Used

Browser Rendering
Workers

Readability Extractor

On this page