Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 7 additions & 8 deletions api-reference/endpoint/crawl/get-status.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ description: 'Poll a running or finished crawl job.'
GET https://v2-api.scrapegraphai.com/api/crawl/:id
```

Returns progress and per-page results for a crawl job started with [`POST /api/crawl`](/api-reference/endpoint/crawl/start).
Returns progress and lightweight per-page metadata for a crawl job started with [`POST /api/crawl`](/api-reference/endpoint/crawl/start). Use [`GET /api/crawl/:id/pages`](/api-reference/endpoint/crawl/pages) to fetch paginated pages with resolved scrape results.

## Path parameters

Expand Down Expand Up @@ -49,27 +49,26 @@ curl -X GET https://v2-api.scrapegraphai.com/api/crawl/79694e03-f2ea-43f2-93cc-7
|-------|-------------|
| `status` | `"running"`, `"completed"`, `"failed"`, or `"stopped"`. |
| `total` / `finished` | Progress counters. |
| `pages[]` | Per-page results, ordered by crawl time. |
| `pages[].scrapeRefId` | UUID of the underlying Scrape call — pass to `GET /api/history/:id` to fetch the formatted content (markdown, HTML, JSON, screenshot, etc.). |
| `pages[]` | Lightweight per-page metadata, ordered by crawl time. |
| `pages[].scrapeRefId` | UUID of the underlying Scrape call. |

<Note>
Poll at a reasonable cadence (every 1–5 seconds) until `status` is `"completed"`, `"failed"`, or `"stopped"`. Or use [Monitor](/services/monitor) with a webhook to avoid polling entirely.
</Note>

## Fetching page content

The crawl response intentionally returns lightweight metadata (`url`, `depth`, `scrapeRefId`, etc.) rather than embedding every page's full body. Use [`GET /api/history/:id`](/api-reference/endpoint/history) with each `scrapeRefId` to fetch the formatted content the underlying scrape produced:
The status response intentionally stays lightweight for polling. Use [`GET /api/crawl/:id/pages`](/api-reference/endpoint/crawl/pages) to fetch crawl pages with the underlying scrape result resolved into each page:

```bash
# Pick a scrapeRefId from the pages[] array above
curl -X GET https://v2-api.scrapegraphai.com/api/history/9701fc04-23de-4684-a48f-7e8fa287550b \
curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/79694e03-f2ea-43f2-93cc-7c6fc26f999a/pages?limit=50&cursor=0" \
-H "SGAI-APIKEY: $SGAI_API_KEY"
```

The response is a `HistoryEntry` with the full `result` payload, e.g. `result.results.markdown.data[0]` for markdown. See the [History endpoint reference](/api-reference/endpoint/history) for the entry shape and a complete crawl-to-content example.
The response is `{ data, pagination }`, where each `data[]` item includes crawl metadata and a `scrape` payload when available.

## Related

- Start a job: [`POST /api/crawl`](/api-reference/endpoint/crawl/start)
- Fetch pages: [`GET /api/crawl/:id/pages`](/api-reference/endpoint/crawl/pages)
- Stop / resume / delete: [Manage crawl jobs](/api-reference/endpoint/crawl/manage)
- Fetch each page's content: [History](/api-reference/endpoint/history)
3 changes: 2 additions & 1 deletion api-reference/endpoint/crawl/start.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ description: 'Kick off an async multi-page crawl and return a job id.'
POST https://v2-api.scrapegraphai.com/api/crawl
```

Starts an asynchronous crawl. The response returns a job `id` immediately; poll [`GET /api/crawl/:id`](/api-reference/endpoint/crawl/get-status) or manage the job via the [control endpoints](/api-reference/endpoint/crawl/manage).
Starts an asynchronous crawl. The response returns a job `id` immediately; poll [`GET /api/crawl/:id`](/api-reference/endpoint/crawl/get-status), fetch page content with [`GET /api/crawl/:id/pages`](/api-reference/endpoint/crawl/pages), or manage the job via the [control endpoints](/api-reference/endpoint/crawl/manage).

## Request body

Expand Down Expand Up @@ -82,5 +82,6 @@ curl -X POST https://v2-api.scrapegraphai.com/api/crawl \
## Related

- Poll progress: [`GET /api/crawl/:id`](/api-reference/endpoint/crawl/get-status)
- Fetch pages: [`GET /api/crawl/:id/pages`](/api-reference/endpoint/crawl/pages)
- Stop, resume, or delete: [Manage crawl jobs](/api-reference/endpoint/crawl/manage)
- Service overview: [Crawl](/services/crawl)
2 changes: 1 addition & 1 deletion api-reference/endpoint/history.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ GET https://v2-api.scrapegraphai.com/api/history
GET https://v2-api.scrapegraphai.com/api/history/:id
```

History stores every API call your account makes (scrape, extract, search, monitor ticks, crawl jobs, schema generations) and lets you fetch them back later by ID. The most common use case is **retrieving the content of a crawled page** — `GET /api/crawl/:id` returns each page's `scrapeRefId`, and you call `GET /api/history/:scrapeRefId` to get the formatted content (markdown, HTML, JSON extraction, screenshots, etc.) that the underlying scrape produced.
History stores every API call your account makes (scrape, extract, search, monitor ticks, crawl jobs, schema generations) and lets you fetch them back later by ID. For crawl page content, use [`GET /api/crawl/:id/pages`](/api-reference/endpoint/crawl/pages) first; it returns paginated crawl pages with the underlying scrape result resolved into each page. Use History when you need to inspect an individual underlying request by its `scrapeRefId`.

## List history

Expand Down
1 change: 1 addition & 0 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,7 @@
"pages": [
"api-reference/endpoint/crawl/start",
"api-reference/endpoint/crawl/get-status",
"api-reference/endpoint/crawl/pages",
"api-reference/endpoint/crawl/manage"
]
},
Expand Down
75 changes: 53 additions & 22 deletions services/crawl.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,10 @@ curl -X POST https://v2-api.scrapegraphai.com/api/crawl \
# Check status (replace :id with the crawl id returned above)
curl -X GET https://v2-api.scrapegraphai.com/api/crawl/:id \
-H "SGAI-APIKEY: $SGAI_API_KEY"

# Fetch pages with resolved scrape results
curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/:id/pages?limit=50&cursor=0" \
-H "SGAI-APIKEY: $SGAI_API_KEY"
```

</CodeGroup>
Expand Down Expand Up @@ -173,38 +177,65 @@ Get your API key from the [dashboard](https://scrapegraphai.com/dashboard).

## Fetching page content

The crawl response returns each page as lightweight metadata (`url`, `depth`, `scrapeRefId`, …) — not the full body. Use the [History](/services/history) service with each `scrapeRefId` to pull the formatted content the underlying scrape produced.

<CodeGroup>
`GET /api/crawl/:id` is designed for status polling and returns lightweight page metadata. To fetch the actual per-page content, call the paginated pages endpoint:

```python Python
# After the crawl completes, fetch the markdown for each page
for page in status.data.pages:
if page.status != "completed":
continue
entry = sgai.history.get(page.scrape_ref_id)
md = entry.data.result.results.get("markdown", {}).get("data", [None])[0]
print(page.url, "->", md[:80] if md else "(empty)")
```bash cURL
curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/79694e03-f2ea-43f2-93cc-7c6fc26f999a/pages?limit=50&cursor=0" \
-H "SGAI-APIKEY: $SGAI_API_KEY"
```

```javascript JavaScript
for (const page of status.data.pages) {
if (page.status !== "completed") continue;
const entry = await sgai.history.get(page.scrapeRefId);
const md = entry.data?.result?.results?.markdown?.data?.[0];
console.log(page.url, "->", md?.slice(0, 80) ?? "(empty)");
The response is cursor-paginated:

```json
{
"data": [
{
"url": "https://example.com",
"status": "completed",
"depth": 0,
"parentUrl": null,
"scrapeRefId": "83a911ed-c0bc-4a8c-ad62-8efeeb93f33a",
"scrape": {
"results": {
"markdown": {
"data": ["# Example Domain\n\nThis domain is for use in illustrative examples..."]
}
},
"metadata": {
"contentType": "text/html"
}
}
}
],
"pagination": {
"limit": 50,
"nextCursor": null
}
}
```

```bash cURL
# Pick any scrapeRefId from the pages[] array of the crawl status response
curl -X GET https://v2-api.scrapegraphai.com/api/history/<scrapeRefId> \
`limit` controls how many crawl pages are returned in one response. It defaults to `50`, with a maximum of `100`.

`cursor` is a zero-based index into the ordered crawl page list. Start with `cursor=0`, then use `pagination.nextCursor` as the next request's `cursor` until it returns `null`.

```bash
# First 50 pages
curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/:id/pages?limit=50&cursor=0" \
-H "SGAI-APIKEY: $SGAI_API_KEY"

# Next 50 pages when the previous response returns "nextCursor": "50"
curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/:id/pages?limit=50&cursor=50" \
-H "SGAI-APIKEY: $SGAI_API_KEY"
```

</CodeGroup>
See the [Get crawl pages API reference](/api-reference/endpoint/crawl/pages) for the full response shape.

See the [History service](/services/history) for the full entry shape and the `requestParentId` linkage that ties each child scrape back to its parent crawl.
If you only need one page's underlying Scrape request, fetch that page's `scrapeRefId` through [History](/api-reference/endpoint/history):

```bash
curl -X GET https://v2-api.scrapegraphai.com/api/history/83a911ed-c0bc-4a8c-ad62-8efeeb93f33a \
-H "SGAI-APIKEY: $SGAI_API_KEY"
```

## Managing Crawl Jobs

Expand Down