Skip to content

feat: document-sanitization EPDF_* exports (XMP, thumbnails, JavaScript)#27

Open
Phauks wants to merge 1 commit into
embedpdf:embedpdf/mainfrom
Phauks:feat/document-sanitization
Open

feat: document-sanitization EPDF_* exports (XMP, thumbnails, JavaScript)#27
Phauks wants to merge 1 commit into
embedpdf:embedpdf/mainfrom
Phauks:feat/document-sanitization

Conversation

@Phauks

@Phauks Phauks commented Jun 13, 2026

Copy link
Copy Markdown

Implements the engine (C++) side of embedpdf/embed-pdf-viewer#673 — document-sanitization removal functions for redaction defensibility.

Adds three EPDF_* extension functions, mirroring the existing EPDF_SetMetaText style (declared in public/fpdf_doc.h, implemented in fpdfsdk/), so the WASM build's export generator picks them up automatically:

  • EPDF_RemoveXMPMetadata(doc) — removes the catalog /Metadata (XMP) stream. XMP is stored separately from /Info, so clearing the Info dict leaves author/title/history in XMP; this removes it.
  • EPDF_RemoveEmbeddedThumbnails(doc) — removes every page's /Thumb (can retain a pre-redaction page image).
  • EPDF_RemoveAllJavaScript(doc) — removes the catalog /Names /JavaScript name tree, a JavaScript /OpenAction (a plain GoTo /OpenAction is preserved), and the catalog /AA.

Each uses the same CPDFDocumentFromFPDFDocument + GetMutableRoot() / RemoveFor (and GetMutablePageDictionary for thumbnails) patterns as the neighbouring extensions.

The TypeScript sanitizeDocument(doc, options) engine method that composes these (plus the existing removeAttachment loop and a non-incremental save), along with Node tests asserting each vector is removed while unrelated content is preserved, are in a companion PR on embedpdf/embed-pdf-viewer (which bumps the pdfium-src submodule to this commit once merged).

Open questions for maintainers are in #673 (granular exports vs. a single EPDF_SanitizeDocument; and treating hidden OCG layers as a separate follow-up).

Add three EmbedPDF extension functions for redaction-defensibility scrubbing
of non-content hidden vectors, mirroring the existing EPDF_SetMetaText style:

- EPDF_RemoveXMPMetadata: drop the catalog /Metadata XMP stream (survives an
  Info-dict clear — the embedpdf#1 sanitization miss).
- EPDF_RemoveEmbeddedThumbnails: drop every page /Thumb.
- EPDF_RemoveAllJavaScript: drop /Names /JavaScript, JS /OpenAction, and /AA.

Declared in public/fpdf_doc.h (auto-exported by the WASM build's generator),
implemented in fpdfsdk/fpdf_doc.cpp via GetMutableRoot()/RemoveFor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant