docs: Add guide `secure-scraping` by Mantisus · Pull Request #1908 · apify/crawlee-python

Mantisus · 2026-05-22T02:41:43Z

Description

Add a guide covering the security threats a crawler can encounter while scraping, how to handle each of them, and the Crawlee defaults that already mitigate some of them.

Issues

Closes: Secure scraping guide #1872

vdusek

a few thoughts

vdusek · 2026-05-27T09:28:49Z

+Because the target decides what your crawler receives, a malicious or compromised site can try to:
+
+- **Steer your crawler to URLs you never intended to visit** - other hosts, or internal services that are not reachable from the public internet.
+- **Reach non-HTTP destinations** through schemes like `file://`, `gopher://`, `ftp://`, or `dict://`, to read local files or talk to services such as Redis.


it seems weird to me to specify just Redis and nothing else without any further context 🙂

vdusek · 2026-05-27T09:29:50Z

+
+- **Steer your crawler to URLs you never intended to visit** - other hosts, or internal services that are not reachable from the public internet.
+- **Reach non-HTTP destinations** through schemes like `file://`, `gopher://`, `ftp://`, or `dict://`, to read local files or talk to services such as Redis.
+- **Exhaust your resources** with a crawler trap, an oversized response, or a decompression bomb.


we should either explain the individual "traps" or provide external links

vdusek · 2026-05-27T09:34:11Z

+
+:::info
+
+Crawlee for Python had this SSRF gap in its sitemap and `robots.txt` handling before version 1.7.0, fixed in [#1862](https://github.com/apify/crawlee-python/pull/1862) and [#1864](https://github.com/apify/crawlee-python/pull/1864). See the advisory [GHSA-3r75-xc34-5f44](https://github.com/apify/crawlee-python/security/advisories/GHSA-3r75-xc34-5f44).


Not sure we wanna mention this - @B4nan, could you tell us your opinion please?

vdusek · 2026-05-27T09:40:15Z

+
+When following links from a page, <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> keeps only those on the same hostname by default and filters out links to any other host. Crawlee also re-checks the host a request finally lands on, so a redirect that ends on a different host is rejected. The client still walks the whole chain to get there, though, so an intermediate hop to an internal address is fetched along the way. Note that `same-hostname` does **not** include subdomains - `example.com` will not match `api.example.com`.
+
+```python


I know it is a small code example, but for maintainability, could you please implement it as a dedicated Python script, as we do for others? (applies to all examples)

vdusek · 2026-05-27T09:42:04Z

+
+Isolation also changes the calculus for the application-level controls: inside a dedicated, egress-restricted environment the blast radius of an SSRF is contained, so widening the scope with `strategy='all'` or accepting arbitrary URLs is far less risky than it would be on a shared host.
+
+## Running on the Apify platform


maybe this could be a subsection of isolation? 🙂

vdusek · 2026-05-27T09:46:21Z

@@ -0,0 +1,123 @@
+---
+id: secure-scraping
+title: Secure scraping


maybe "Security of web scraping" would be better?

Mantisus added 2 commits May 22, 2026 02:40

add guide secure-scraping

d14783e

update

2c887ed

Mantisus marked this pull request as ready for review May 24, 2026 17:02

Mantisus assigned janbuchar, vdusek and Mantisus and unassigned janbuchar and vdusek May 24, 2026

Mantisus requested review from janbuchar and vdusek May 24, 2026 17:03

vdusek reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add guide `secure-scraping`#1908

docs: Add guide `secure-scraping`#1908
Mantisus wants to merge 2 commits into
apify:masterfrom
Mantisus:secure-scraping

Mantisus commented May 22, 2026 •

edited

Loading

Uh oh!

vdusek left a comment •

edited

Loading

Uh oh!

vdusek May 27, 2026

Uh oh!

vdusek May 27, 2026

Uh oh!

vdusek May 27, 2026

Uh oh!

vdusek May 27, 2026

Uh oh!

vdusek May 27, 2026

Uh oh!

vdusek May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		:::info

		Crawlee for Python had this SSRF gap in its sitemap and `robots.txt` handling before version 1.7.0, fixed in [#1862](https://github.com/apify/crawlee-python/pull/1862) and [#1864](https://github.com/apify/crawlee-python/pull/1864). See the advisory [GHSA-3r75-xc34-5f44](https://github.com/apify/crawlee-python/security/advisories/GHSA-3r75-xc34-5f44).


		When following links from a page, <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> keeps only those on the same hostname by default and filters out links to any other host. Crawlee also re-checks the host a request finally lands on, so a redirect that ends on a different host is rejected. The client still walks the whole chain to get there, though, so an intermediate hop to an internal address is fetched along the way. Note that `same-hostname` does not include subdomains - `example.com` will not match `api.example.com`.

		```python


		Isolation also changes the calculus for the application-level controls: inside a dedicated, egress-restricted environment the blast radius of an SSRF is contained, so widening the scope with `strategy='all'` or accepting arbitrary URLs is far less risky than it would be on a shared host.

		## Running on the Apify platform

Conversation

Mantisus commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Uh oh!

vdusek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdusek May 27, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek May 27, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek May 27, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek May 27, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek May 27, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Mantisus commented May 22, 2026 •

edited

Loading

vdusek left a comment •

edited

Loading