# riddle_crawl

Crawl a site and extract content from each page into a downloadable dataset.

Plain text for agents: https://riddledc.com/docs/crawl/markdown.md

## Overview

`riddle_crawl` combines mapping and scraping into a single operation. It crawls a website, extracts structured content from every page, and returns the results as a downloadable dataset.

Choose from `jsonl`, `json`, `csv`, or `zip`. The dataset is returned as a downloadable artifact.

## Endpoint

POST /v1/crawl

```bash
curl -X POST "https://api.riddledc.com/v1/crawl" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "max_pages": 100, "format": "jsonl"}'
```

## Parameters

| Parameter | Type | Description |
| --- | --- | --- |
| `url` | string | Required. Starting URL to crawl. |
| `max_pages` | number | Maximum pages to crawl. Default: `100`. |
| `format` | string | Output format: `jsonl` (default), `json`, `csv`, or `zip`. |
| `include_patterns` | string[] | Only crawl URLs matching these glob patterns. |
| `exclude_patterns` | string[] | Skip URLs matching these glob patterns. |
| `js_rendering` | boolean | Render JavaScript on each page. Default: `true`. |
| `respect_robots` | boolean | Honor `robots.txt` directives. Default: `true`. |

## Response

The response includes dataset metadata. The full dataset is available as a downloadable artifact.

```json
{
  "id": "crawl_a1b2c3d4",
  "status": "complete",
  "pages_crawled": 47,
  "format": "jsonl",
  "artifact_url": "https://api.riddledc.com/v1/artifacts/crawl_a1b2c3d4",
  "bytes": 284531
}
```

Each line in the JSONL output contains one page:

```json
{
  "url": "https://example.com/about",
  "title": "About Us",
  "description": "Learn about our company.",
  "markdown": "# About Us\n\nWe are...",
  "word_count": 312,
  "links": [],
  "headings": []
}
```

## Examples

### JavaScript

```javascript
const response = await fetch("https://api.riddledc.com/v1/crawl", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${RIDDLE_API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    url: "https://example.com",
    max_pages: 100,
    format: "jsonl"
  })
});

const { artifact_url, pages_crawled } = await response.json();
console.log(`Crawled ${pages_crawled} pages`);
```

## Use Cases

- Training data: crawl documentation sites or knowledge bases to build datasets for fine-tuning LLMs.
- Content migration: extract all content from an existing site as structured data for import into a new CMS.
- Competitive analysis: crawl competitor sites to analyze content strategy, page structure, and keyword coverage.
- Archival: create offline backups of entire websites as structured datasets.

