From 59b9c86ed913e57273f8f83f9a8c59c7869a01c0 Mon Sep 17 00:00:00 2001 From: FrancescoSaverioZuppichini Date: Wed, 27 May 2026 22:06:45 +0200 Subject: [PATCH] Added doc for /crawl/:id/pages --- api-reference/endpoint/crawl/get-status.mdx | 15 ++--- api-reference/endpoint/crawl/start.mdx | 3 +- api-reference/endpoint/history.mdx | 2 +- docs.json | 1 + services/crawl.mdx | 75 +++++++++++++++------ 5 files changed, 64 insertions(+), 32 deletions(-) diff --git a/api-reference/endpoint/crawl/get-status.mdx b/api-reference/endpoint/crawl/get-status.mdx index 6b277b4..b367c88 100644 --- a/api-reference/endpoint/crawl/get-status.mdx +++ b/api-reference/endpoint/crawl/get-status.mdx @@ -7,7 +7,7 @@ description: 'Poll a running or finished crawl job.' GET https://v2-api.scrapegraphai.com/api/crawl/:id ``` -Returns progress and per-page results for a crawl job started with [`POST /api/crawl`](/api-reference/endpoint/crawl/start). +Returns progress and lightweight per-page metadata for a crawl job started with [`POST /api/crawl`](/api-reference/endpoint/crawl/start). Use [`GET /api/crawl/:id/pages`](/api-reference/endpoint/crawl/pages) to fetch paginated pages with resolved scrape results. ## Path parameters @@ -49,8 +49,8 @@ curl -X GET https://v2-api.scrapegraphai.com/api/crawl/79694e03-f2ea-43f2-93cc-7 |-------|-------------| | `status` | `"running"`, `"completed"`, `"failed"`, or `"stopped"`. | | `total` / `finished` | Progress counters. | -| `pages[]` | Per-page results, ordered by crawl time. | -| `pages[].scrapeRefId` | UUID of the underlying Scrape call — pass to `GET /api/history/:id` to fetch the formatted content (markdown, HTML, JSON, screenshot, etc.). | +| `pages[]` | Lightweight per-page metadata, ordered by crawl time. | +| `pages[].scrapeRefId` | UUID of the underlying Scrape call. | Poll at a reasonable cadence (every 1–5 seconds) until `status` is `"completed"`, `"failed"`, or `"stopped"`. Or use [Monitor](/services/monitor) with a webhook to avoid polling entirely. @@ -58,18 +58,17 @@ Poll at a reasonable cadence (every 1–5 seconds) until `status` is `"completed ## Fetching page content -The crawl response intentionally returns lightweight metadata (`url`, `depth`, `scrapeRefId`, etc.) rather than embedding every page's full body. Use [`GET /api/history/:id`](/api-reference/endpoint/history) with each `scrapeRefId` to fetch the formatted content the underlying scrape produced: +The status response intentionally stays lightweight for polling. Use [`GET /api/crawl/:id/pages`](/api-reference/endpoint/crawl/pages) to fetch crawl pages with the underlying scrape result resolved into each page: ```bash -# Pick a scrapeRefId from the pages[] array above -curl -X GET https://v2-api.scrapegraphai.com/api/history/9701fc04-23de-4684-a48f-7e8fa287550b \ +curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/79694e03-f2ea-43f2-93cc-7c6fc26f999a/pages?limit=50&cursor=0" \ -H "SGAI-APIKEY: $SGAI_API_KEY" ``` -The response is a `HistoryEntry` with the full `result` payload, e.g. `result.results.markdown.data[0]` for markdown. See the [History endpoint reference](/api-reference/endpoint/history) for the entry shape and a complete crawl-to-content example. +The response is `{ data, pagination }`, where each `data[]` item includes crawl metadata and a `scrape` payload when available. ## Related - Start a job: [`POST /api/crawl`](/api-reference/endpoint/crawl/start) +- Fetch pages: [`GET /api/crawl/:id/pages`](/api-reference/endpoint/crawl/pages) - Stop / resume / delete: [Manage crawl jobs](/api-reference/endpoint/crawl/manage) -- Fetch each page's content: [History](/api-reference/endpoint/history) diff --git a/api-reference/endpoint/crawl/start.mdx b/api-reference/endpoint/crawl/start.mdx index 5e06ec6..4a1586d 100644 --- a/api-reference/endpoint/crawl/start.mdx +++ b/api-reference/endpoint/crawl/start.mdx @@ -7,7 +7,7 @@ description: 'Kick off an async multi-page crawl and return a job id.' POST https://v2-api.scrapegraphai.com/api/crawl ``` -Starts an asynchronous crawl. The response returns a job `id` immediately; poll [`GET /api/crawl/:id`](/api-reference/endpoint/crawl/get-status) or manage the job via the [control endpoints](/api-reference/endpoint/crawl/manage). +Starts an asynchronous crawl. The response returns a job `id` immediately; poll [`GET /api/crawl/:id`](/api-reference/endpoint/crawl/get-status), fetch page content with [`GET /api/crawl/:id/pages`](/api-reference/endpoint/crawl/pages), or manage the job via the [control endpoints](/api-reference/endpoint/crawl/manage). ## Request body @@ -82,5 +82,6 @@ curl -X POST https://v2-api.scrapegraphai.com/api/crawl \ ## Related - Poll progress: [`GET /api/crawl/:id`](/api-reference/endpoint/crawl/get-status) +- Fetch pages: [`GET /api/crawl/:id/pages`](/api-reference/endpoint/crawl/pages) - Stop, resume, or delete: [Manage crawl jobs](/api-reference/endpoint/crawl/manage) - Service overview: [Crawl](/services/crawl) diff --git a/api-reference/endpoint/history.mdx b/api-reference/endpoint/history.mdx index b26e0e1..de7dfe6 100644 --- a/api-reference/endpoint/history.mdx +++ b/api-reference/endpoint/history.mdx @@ -8,7 +8,7 @@ GET https://v2-api.scrapegraphai.com/api/history GET https://v2-api.scrapegraphai.com/api/history/:id ``` -History stores every API call your account makes (scrape, extract, search, monitor ticks, crawl jobs, schema generations) and lets you fetch them back later by ID. The most common use case is **retrieving the content of a crawled page** — `GET /api/crawl/:id` returns each page's `scrapeRefId`, and you call `GET /api/history/:scrapeRefId` to get the formatted content (markdown, HTML, JSON extraction, screenshots, etc.) that the underlying scrape produced. +History stores every API call your account makes (scrape, extract, search, monitor ticks, crawl jobs, schema generations) and lets you fetch them back later by ID. For crawl page content, use [`GET /api/crawl/:id/pages`](/api-reference/endpoint/crawl/pages) first; it returns paginated crawl pages with the underlying scrape result resolved into each page. Use History when you need to inspect an individual underlying request by its `scrapeRefId`. ## List history diff --git a/docs.json b/docs.json index 12e2615..2e312bf 100644 --- a/docs.json +++ b/docs.json @@ -235,6 +235,7 @@ "pages": [ "api-reference/endpoint/crawl/start", "api-reference/endpoint/crawl/get-status", + "api-reference/endpoint/crawl/pages", "api-reference/endpoint/crawl/manage" ] }, diff --git a/services/crawl.mdx b/services/crawl.mdx index c336d9d..d592030 100644 --- a/services/crawl.mdx +++ b/services/crawl.mdx @@ -113,6 +113,10 @@ curl -X POST https://v2-api.scrapegraphai.com/api/crawl \ # Check status (replace :id with the crawl id returned above) curl -X GET https://v2-api.scrapegraphai.com/api/crawl/:id \ -H "SGAI-APIKEY: $SGAI_API_KEY" + +# Fetch pages with resolved scrape results +curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/:id/pages?limit=50&cursor=0" \ + -H "SGAI-APIKEY: $SGAI_API_KEY" ``` @@ -173,38 +177,65 @@ Get your API key from the [dashboard](https://scrapegraphai.com/dashboard). ## Fetching page content -The crawl response returns each page as lightweight metadata (`url`, `depth`, `scrapeRefId`, …) — not the full body. Use the [History](/services/history) service with each `scrapeRefId` to pull the formatted content the underlying scrape produced. - - +`GET /api/crawl/:id` is designed for status polling and returns lightweight page metadata. To fetch the actual per-page content, call the paginated pages endpoint: -```python Python -# After the crawl completes, fetch the markdown for each page -for page in status.data.pages: - if page.status != "completed": - continue - entry = sgai.history.get(page.scrape_ref_id) - md = entry.data.result.results.get("markdown", {}).get("data", [None])[0] - print(page.url, "->", md[:80] if md else "(empty)") +```bash cURL +curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/79694e03-f2ea-43f2-93cc-7c6fc26f999a/pages?limit=50&cursor=0" \ + -H "SGAI-APIKEY: $SGAI_API_KEY" ``` -```javascript JavaScript -for (const page of status.data.pages) { - if (page.status !== "completed") continue; - const entry = await sgai.history.get(page.scrapeRefId); - const md = entry.data?.result?.results?.markdown?.data?.[0]; - console.log(page.url, "->", md?.slice(0, 80) ?? "(empty)"); +The response is cursor-paginated: + +```json +{ + "data": [ + { + "url": "https://example.com", + "status": "completed", + "depth": 0, + "parentUrl": null, + "scrapeRefId": "83a911ed-c0bc-4a8c-ad62-8efeeb93f33a", + "scrape": { + "results": { + "markdown": { + "data": ["# Example Domain\n\nThis domain is for use in illustrative examples..."] + } + }, + "metadata": { + "contentType": "text/html" + } + } + } + ], + "pagination": { + "limit": 50, + "nextCursor": null + } } ``` -```bash cURL -# Pick any scrapeRefId from the pages[] array of the crawl status response -curl -X GET https://v2-api.scrapegraphai.com/api/history/ \ +`limit` controls how many crawl pages are returned in one response. It defaults to `50`, with a maximum of `100`. + +`cursor` is a zero-based index into the ordered crawl page list. Start with `cursor=0`, then use `pagination.nextCursor` as the next request's `cursor` until it returns `null`. + +```bash +# First 50 pages +curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/:id/pages?limit=50&cursor=0" \ + -H "SGAI-APIKEY: $SGAI_API_KEY" + +# Next 50 pages when the previous response returns "nextCursor": "50" +curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/:id/pages?limit=50&cursor=50" \ -H "SGAI-APIKEY: $SGAI_API_KEY" ``` - +See the [Get crawl pages API reference](/api-reference/endpoint/crawl/pages) for the full response shape. -See the [History service](/services/history) for the full entry shape and the `requestParentId` linkage that ties each child scrape back to its parent crawl. +If you only need one page's underlying Scrape request, fetch that page's `scrapeRefId` through [History](/api-reference/endpoint/history): + +```bash +curl -X GET https://v2-api.scrapegraphai.com/api/history/83a911ed-c0bc-4a8c-ad62-8efeeb93f33a \ + -H "SGAI-APIKEY: $SGAI_API_KEY" +``` ## Managing Crawl Jobs