Readability Extractor
Extract clean article content from URLs using Browser Rendering and readability heuristics
Load a fully rendered page with Browser Rendering, then strip navigation, ads, and sidebars using readability-style heuristics. Returns title, author (when detectable), body text, word count, and estimated read time.
Features
- GET /extract - rendered DOM extraction with readability heuristics
- Returns title, author, body, wordCount, readTimeMinutes
- Uses @cloudflare/puppeteer like other Browser Rendering experiments
API Reference
GET /extract
url string (required) - Article URL (http or https).
Example Request
curl "https://your-worker.workers.dev/extract?url=https://example.com/article"Error Codes
400-INVALID_URL502-EXTRACT_ERROR
Use Cases
- Build reading-mode or newsletter digest pipelines
- Extract main content from JavaScript-heavy news sites
- Prototype RAG document ingestion from article URLs
Limitations
- Requires Browser Rendering on your account
- Heuristic extraction; not identical to Mozilla Readability
- Local dev may need
wrangler dev --remotefor browser binding
Deployment
Configure bindings
Browser binding BROWSER and nodejs_compat_v2 in wrangler.json.
Test your deployment
See the experiment README for curl examples.
Local Development
cd apps/experiments/readability-extractor
npm install
npm run devConfiguration
Browser binding BROWSER and nodejs_compat_v2 in wrangler.json.