A configurable Harper plugin that prerenders pages for bots and crawlers. It provides:
- A bot HTTP entry point (
/p/<absolute-url>by default) that serves cached prerendered HTML or fetches from the origin, with content-encoding negotiation and conditional-request (304) handling. - A render queue + scheduler (
render_queue,RenderTarget,RenderSchedule) that an external render service (see@harperfast/prerender-browser) claims jobs from and posts results back to. - Sitemap ingestion (
Sitemap) that discovers URLs and schedules them for rendering. - A prerendered-page cache (
PrerenderedPage) and indexability signals (NonIndexable).
Everything that used to be hardcoded — domains, security token, device types, render/refresh
schedules, user-agent strings, TTLs — is supplied per deployment through the host application's
config.yaml.
npm install @harperfast/prerenderAdd it to your Harper application's config.yaml:
rest: true # required for the @export-ed table REST endpoints
'@harperfast/prerender':
package: '@harperfast/prerender'
files: '/'
# --- options (all optional; defaults shown) ---
botPathPrefix: /p/ # requests under this prefix are treated as bot requests
domains: [] # indexable-host allowlist; empty = allow all hosts
ingress: # how incoming bot requests are parsed (see "Ingress modes" below)
mode: prefix # 'prefix' (native /p/<absolute-url>) or 'forwarded' (reverse proxy/CDN)
deviceTypeSource: header # 'header' (deviceTypeHeader) or 'path' (first path segment)
deviceTypeHeader: x-device-type
forwardedHostHeader: x-forwarded-host # forwarded mode: original public host
forwardedProtoHeader: x-forwarded-proto
defaultProtocol: https
routes: [] # forwarded mode: [{ match: exact|prefix, path, queryParams: [...] }]
deviceTypes:
supported: [desktop, mobile, tablet]
default: [desktop, mobile] # device types scheduled for auto-discovered pages
cacheKey:
delimiter: '|'
attributes: [url, deviceType]
url:
queryParams: [page] # query params kept in the cache key; ['*'] = keep all, [] = drop all
securityToken: # shared secret sent to the origin; must match the render client
header: x-harper-renderer-bypass
value: '' # SET THIS per deployment (or use valueEnv to keep it out of config.yaml)
valueEnv: '' # if set, the token is read from this env var and overrides `value`
debugHeader: # when this request header is present, debug response headers are added
key: x-harper-prerender-debug
value: 'true'
page:
ttl: 86400000 # 24h — default cached-page TTL
minTtl: 21600000 # 6h — floor for sitemap-derived TTLs
swrTtl: 10800000 # 3h — stale-while-revalidate window
render:
defaultInterval: 86400000 # 24h — how often a target is re-rendered
time: '07:00' # local time-of-day for the daily render run
timezone: America/New_York
sitemap:
refreshTime: '12:00' # local time-of-day for the daily sitemap refresh
timezone: America/New_York
node: '' # pin the scheduled refresh to this node ('' disables it)
workerIndex: 0 # ...and this worker
queue:
jobLeaseTime: 600000 # 10m — how long a claimed job is leased
statusSyncInterval: 60000 # 1m — how often queue status is recomputed/broadcast
userAgents: # per-device User-Agent strings sent to the origin
desktop: 'Mozilla/5.0 ... HarperPrerender/1.0'
mobile: 'Mozilla/5.0 ... HarperPrerender/1.0'
tablet: 'Mozilla/5.0 ... HarperPrerender/1.0'
excludePathPatterns: ['/search/'] # URLs containing these are never auto-scheduled
analytics:
enabled: true # record bot_request analytics at all
recordUnmatched: true # also record UAs that matched no configured bot (as 'other')
bots: # registry: which crawlers are tracked by name. { name, match } — match is a
- { name: Googlebot, match: googlebot } # case-insensitive UA substring; longer matches win.
- { name: Bingbot, match: bingbot } # Remove an entry to stop tracking that bot.
- { name: GPTBot, match: gptbot }
# ... (see config.js for the full default list)Most options are live-reloaded when you edit config.yaml — no restart needed.
How bot requests reach the plugin is configurable via ingress.mode:
-
prefix(default) — the native model. A request is a bot request when its path starts withbotPathPrefix(/p/), and the remainder of the path is the absolute target URL (GET /p/https://example.com/page). The device type comes from thedeviceTypeHeader(x-device-type). -
forwarded— for sitting behind a reverse proxy / CDN (e.g. Akamai) that routes a restricted set of paths to the plugin. Here the incoming request carries a relative path, the original public host in a forwarded header, and (optionally) the device type as the first path segment:ingress.routesis an ordered list of{ match, path, queryParams }.matchisexactorprefix. A request is a prerender request only if its device-stripped path matches a route — so the plugin's own resource endpoints (/render_queue,/queue_status, …) fall through to REST as long as no route matches them.prefixis a raw string prefix, so keep routes specific (e.g./catalog/, not/c) — an overly broad prefix like/would shadow those resource endpoints. The matched route'squeryParamsis the cache-key / origin-fetch query allowlist (same semantics asurl.queryParams), so different routes can keep different params.- With
deviceTypeSource: path, a leadingdesktop/mobile/tabletsegment is consumed as the device type and stripped before the URL is rebuilt; if absent, the first supported device type is used and the path is left unchanged. - The absolute target URL is rebuilt as
${forwardedProtoHeader || defaultProtocol}://${forwardedHostHeader}${path}${query}. A forwarded host that isn't a barehostname[:port]is rejected (host-injection guard).
Example:
GET /mobile/catalog/x.jsp?CN=...&utm=...withX-Forwarded-Host: www.example.com→ devicemobile, targethttps://www.example.com/catalog/x.jsp?CN=...(a catalog route keeping onlyCN).
Database/table names are fixed. Tables are split across databases by write-transaction coupling — Harper serializes writes per database and commits each database independently, so the hot, high-write queue table is isolated and bursty/heavy writes don't serialize against it:
| Database | Tables | Notes |
|---|---|---|
render_schedule |
RenderSchedule |
the hot render queue — isolated |
render_service |
RenderTarget, QueueStatus |
render-target registry + per-host queue status |
page_cache |
PrerenderedPage |
rendered-HTML cache (heavy blob writes) |
sitemaps |
Sitemap, SitemapRefresh |
sitemap data + refresh marker |
signals |
NonIndexable |
indexability signals |
coordination |
SharedBuffer |
node-local cross-worker SAB (never replicated) |
Because RenderTarget and RenderSchedule now live in separate databases, a target and its schedule
are written as two independent commits (target first). The brief window where a target exists without a
schedule is benign and self-heals on the next sitemap refresh / revalidate.
See src/schemas/schema.graphql.
| Method & path | Purpose |
|---|---|
GET /p/<absolute-url> |
Serve prerendered/cached HTML for a bot (cache hit or origin fetch) |
POST /render_queue/pause |
Pause the queue |
POST /render_queue/resume |
Resume the queue |
POST /render_queue/claim |
Claim due render jobs ({ "limit": N }) |
POST /render_queue/job_result |
Submit a render result (binary; x-metadata-size header) |
GET/PUT/DELETE /RenderTarget/... |
Manage render targets |
POST /RenderTarget {action:"revalidate"} |
Force re-render of matching targets |
GET/POST/DELETE /sitemaps/<url> |
Ingest / list / remove sitemaps |
GET /queue_status |
Read per-host queue status |
bot ──GET /p/<url>──▶ plugin ──cache hit?──▶ serve PrerenderedPage
│ miss
└─▶ fetch origin, serve, and (if indexable) schedule a RenderTarget
render client ──claim──▶ render_queue ──jobs──▶ [headless render] ──job_result──▶ PrerenderedPage
The render service is a separate process; see @harperfast/prerender-browser. Its
RENDERER_BYPASS_* settings must match this plugin's securityToken.
npm test # unit tests (node --test)
npm run lint # from the repo root