riddle_crawl
Crawl a site and extract content from each page into a downloadable dataset.
Overview
riddle_crawl combines mapping and scraping into a single operation. It crawls a website, extracts structured content from every page, and returns the results as a downloadable dataset.
Output formats
Choose from jsonl, json, csv, or zip. The dataset is returned as a downloadable artifact.
Endpoint
/v1/crawlcurl -X POST "https://api.riddledc.com/v1/crawl" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "max_pages": 100, "format": "jsonl"}'Parameters
| Parameter | Type | Description |
|---|---|---|
url | string | Required. Starting URL to crawl. |
max_pages | number | Maximum pages to crawl. Default: 100 |
format | string | Output format: jsonl (default), json, csv, or zip |
include_patterns | string[] | Only crawl URLs matching these glob patterns. |
exclude_patterns | string[] | Skip URLs matching these glob patterns. |
js_rendering | boolean | Render JavaScript on each page. Default: true |
respect_robots | boolean | Honor robots.txt directives. Default: true |
Response
The response includes dataset metadata. The full dataset is available as a downloadable artifact.
{
"id": "crawl_a1b2c3d4",
"status": "complete",
"pages_crawled": 47,
"format": "jsonl",
"artifact_url": "https://api.riddledc.com/v1/artifacts/crawl_a1b2c3d4",
"bytes": 284531
}Dataset Row (JSONL)
Each line in the JSONL output contains one page:
{
"url": "https://example.com/about",
"title": "About Us",
"description": "Learn about our company.",
"markdown": "# About Us\n\nWe are...",
"word_count": 312,
"links": [...],
"headings": [...]
}Examples
JavaScript
const response = await fetch("https://api.riddledc.com/v1/crawl", {
method: "POST",
headers: {
"Authorization": `Bearer ${RIDDLE_API_KEY}`,
"Content-Type": "application/json"
},
body: JSON.stringify({
url: "https://example.com",
max_pages: 100,
format: "jsonl"
})
});
const { artifact_url, pages_crawled } = await response.json();
console.log(`Crawled ${pages_crawled} pages`);
// Download the dataset
const dataset = await fetch(artifact_url, {
headers: { "Authorization": `Bearer ${RIDDLE_API_KEY}` }
}).then(r => r.text());Filtered Crawl with CSV Output
const response = await fetch("https://api.riddledc.com/v1/crawl", {
method: "POST",
headers: {
"Authorization": `Bearer ${RIDDLE_API_KEY}`,
"Content-Type": "application/json"
},
body: JSON.stringify({
url: "https://example.com",
max_pages: 50,
format: "csv",
include_patterns: ["/blog/*"],
exclude_patterns: ["/blog/drafts/*"]
})
}).then(r => r.json());Use Cases
Training Data
Crawl documentation sites or knowledge bases to build datasets for fine-tuning LLMs.
Content Migration
Extract all content from an existing site as structured data for import into a new CMS.
Competitive Analysis
Crawl competitor sites to analyze content strategy, page structure, and keyword coverage.
Archival
Create offline backups of entire websites as structured datasets.