riddle_crawl

Crawl a site and extract content from each page into a downloadable dataset.

Overview

riddle_crawl combines mapping and scraping into a single operation. It crawls a website, extracts structured content from every page, and returns the results as a downloadable dataset.

Output formats

Choose from jsonl, json, csv, or zip. The dataset is returned as a downloadable artifact.

Endpoint

POST/v1/crawl

curl -X POST "https://api.riddledc.com/v1/crawl" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "max_pages": 100, "format": "jsonl"}'

Parameters

Parameter	Type	Description
`url`	string	Required. Starting URL to crawl.
`max_pages`	number	Maximum pages to crawl. Default: `100`
`format`	string	Output format: `jsonl` (default), `json`, `csv`, or `zip`
`include_patterns`	string[]	Only crawl URLs matching these glob patterns.
`exclude_patterns`	string[]	Skip URLs matching these glob patterns.
`js_rendering`	boolean	Render JavaScript on each page. Default: `true`
`respect_robots`	boolean	Honor `robots.txt` directives. Default: `true`

Response

The response includes dataset metadata. The full dataset is available as a downloadable artifact.

{
  "id": "crawl_a1b2c3d4",
  "status": "complete",
  "pages_crawled": 47,
  "format": "jsonl",
  "artifact_url": "https://api.riddledc.com/v1/artifacts/crawl_a1b2c3d4",
  "bytes": 284531
}

Dataset Row (JSONL)

Each line in the JSONL output contains one page:

{
  "url": "https://example.com/about",
  "title": "About Us",
  "description": "Learn about our company.",
  "markdown": "# About Us\n\nWe are...",
  "word_count": 312,
  "links": [...],
  "headings": [...]
}

Examples

JavaScript

const response = await fetch("https://api.riddledc.com/v1/crawl", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${RIDDLE_API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    url: "https://example.com",
    max_pages: 100,
    format: "jsonl"
  })
});

const { artifact_url, pages_crawled } = await response.json();
console.log(`Crawled ${pages_crawled} pages`);

// Download the dataset
const dataset = await fetch(artifact_url, {
  headers: { "Authorization": `Bearer ${RIDDLE_API_KEY}` }
}).then(r => r.text());

Filtered Crawl with CSV Output

const response = await fetch("https://api.riddledc.com/v1/crawl", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${RIDDLE_API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    url: "https://example.com",
    max_pages: 50,
    format: "csv",
    include_patterns: ["/blog/*"],
    exclude_patterns: ["/blog/drafts/*"]
  })
}).then(r => r.json());

Use Cases

Training Data

Crawl documentation sites or knowledge bases to build datasets for fine-tuning LLMs.

Content Migration

Extract all content from an existing site as structured data for import into a new CMS.

Competitive Analysis

Crawl competitor sites to analyze content strategy, page structure, and keyword coverage.

Archival

Create offline backups of entire websites as structured datasets.