Skip to content

Add draft threat model + SECURITY.md/AGENTS.md discoverability#3966

Open
potiuk wants to merge 2 commits into
apache:mainfrom
potiuk:asf-security/threat-model-2026-06-02
Open

Add draft threat model + SECURITY.md/AGENTS.md discoverability#3966
potiuk wants to merge 2 commits into
apache:mainfrom
potiuk:asf-security/threat-model-2026-06-02

Conversation

@potiuk

@potiuk potiuk commented Jun 2, 2026

Copy link
Copy Markdown
Member

What this is

A draft threat model for Apache Jena, proposed by the ASF Security team for the Jena PMC to review, correct, or reject — drafted by the Security team's threat-model tooling from Jena's public docs and repository, following the ASF Security threat-model rubric. It was requested by the PMC (andy@) as a starting point.

This PR:

  • adds THREAT_MODEL.md — the draft model;
  • adds SECURITY.md — a short security policy linking the threat model;
  • adds AGENTS.md with a ## Security section, so the chain AGENTS.md → SECURITY.md → THREAT_MODEL.md is mechanically discoverable by automated security scanners.

How to read it

Every claim is provenance-tagged: (documented) (from Jena's docs/repo), (inferred) (reasoned from architecture, not yet confirmed), (maintainer) (confirmed by the PMC). This v0 is ~12 documented / ~34 inferred. The §14 Open questions section collects every inferred claim into waves — that is where review time is best spent. The model treats Fuseki's SPARQL endpoint as the primary boundary (public query, localhost-only admin by default are documented) and flags the high-value query surfaces for confirmation:

  • the SPARQL Update default (read-only vs update-enabled) — decides whether anonymous write is in-model (wave 1);
  • SERVICE federation (SSRF), file:/FROM local-file read, ARQ JavaScript/custom functions (code exec), and RDF/XML XXE in RIOT — is each prevented/restrictable by default? (wave 2);
  • the resource/DoS line (query timeout / result limits) — addressing the PMC's volume concern (wave 3).

Nothing here is a requirement — the model is for the PMC to own. Comment inline, edit the branch, or reply on the email thread.

@rvesse rvesse left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @potiuk for the first pass at this, I think this looks like a pretty solid starting point.

I have gone through the model and made various suggested edits throughout (mostly confirming/clarifying things you'd marked as needed that). I won't commit the edits yet as I want to give the rest of the PMC chance to review the initial draft as-is

Have also provided initial answers for most of the questions. Some of those answers are just me pinging other PMC members with the relevant expertise in a particular area of the codebase to provide their input

Comment thread THREAT_MODEL.md Outdated
Comment thread THREAT_MODEL.md Outdated
Comment thread THREAT_MODEL.md Outdated
Comment thread THREAT_MODEL.md Outdated
Comment thread THREAT_MODEL.md Outdated
Comment thread THREAT_MODEL.md
**Wave 2 — the high-value query surfaces (the Jena CVE classes):**
4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10.
5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9.
6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, opt-in and explicit white list for permitted JS functions.

For custom Java functions operator has to explicitly add code to their class path so operator responsibility to verify they trust the custom function code

Code execution is by-design-operator-enabled

Comment thread THREAT_MODEL.md
4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10.
5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9.
6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a.
7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so, this is @afs's area of expertise having rewritten those parsers relatively recently

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDF/XML - XXE is disabled (JenaXMLInput).

JSON-LD has it's own version the XXE concern.

Comment thread THREAT_MODEL.md
7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8.

**Wave 3 — resources, API, meta:**
8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can treat these are bugs, these are known issues in the wider RDF/SPARQL community and its operator responsibility to apply configuration (e.g. query timeout), request size limits via reverse proxy etc.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ought to couple with needing to enable SERVICE in Fuseki.

(the volume concern I mentioned is number of requests - it can flood a server)

Comment thread THREAT_MODEL.md

**Wave 3 — resources, API, meta:**
8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a.
9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for in-process its app responsibility to verify untrusted inputs and apply any appropriate hardening

Parameterised queries are recommended pattern

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fuseki itself does not have parameterised queries.

Comment thread THREAT_MODEL.md
**Wave 3 — resources, API, meta:**
8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a.
9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9.
10. Any other recurring scanner/fuzzer false positives to seed §11a? → §11a.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth reviewing the TDB FAQs page, the following two in particular come to mind that are recurring topics on the mailing lists:

  1. Does Fuseki/TDB have a memory leak? - Unbounded memory growth under continuous read/write load is a known issue and our use of a WAL ensures no data is lost should the process crash/restart due to this
  2. Why is the database much large on disk than my input? - This is two-fold, firstly we use sparse files so depending on how disk usage is inspected (and the filesystem in use) different usage metrics can be reported. Secondly TDB2 uses MVCC trees so each write transaction potentially creates new blocks in the trees orphaning the old blocks (once any active read transactions on the old tree state have completed), this is expected behaviour and we provide a compaction operation that we recommend operators run periodically to reclaim disk space.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Item 1 should be for TDB1 only.

@rvesse rvesse requested review from afs and arne-bdt June 3, 2026 13:41
@potiuk

potiuk commented Jun 4, 2026

Copy link
Copy Markdown
Member Author

Thanks @rvesse — genuinely useful, detailed review. Understood you're holding your own commits so the rest of the PMC can review the as-is draft first, so I won't push anything over that.

For when you're ready: I've staged a revision incorporating all your suggestions — SSRF via SERVICE documented as a VALID vector (with the "no allow-list today" note), FROM/file: reworded as dataset-implementation-dependent (TDB2 restricts to dataset graphs), the "super-linear" DoS framing removed (operator-tuned, affects all compliant engines), the ARQ-JS (opt-in + eval-blacklisted) vs Java-custom-function (operator-classpath, by-design) distinction, and the TDB-FAQ resource references. It's ready to land whenever the PMC's done with the as-is draft — just say the word. (And afs@ can confirm the RDF/XML XXE-default question when convenient — that's the one item I left open.) No rush.

@arne-bdt

arne-bdt commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

I flew over the files and have nothing to add so far.

(Yesterday, I realized I am overworked since a few days and need some time off, to be able to think straight again)

@potiuk

potiuk commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

Thanks @arne-bdt — appreciate the look. And genuinely, take the time off you need; there's zero rush here, the draft will keep. Be well. 🙂

Comment thread THREAT_MODEL.md
- **Reporting cross-reference:** §8-property violations → report privately per ASF process (`security@apache.org` → `private@jena.apache.org`); §3/§9 findings are closed citing this document.
- **Provenance legend:** *(documented)* = Jena's own docs/repo; *(maintainer)* = confirmed by a Jena PMC member through this process (andy@ has ratified destination + the help-with-model request); *(inferred)* = reasoned from architecture, not yet confirmed — each has a matching §14 open question.
- **Draft confidence:** ~12 documented / ~2 maintainer / ~34 inferred.
- **What Jena is:** Apache Jena is a Java framework for building Semantic-Web / linked-data applications over RDF. It provides an in-process API to RDF data held in memory or in a native store (TDB), the ARQ SPARQL query/update engine, RIOT parsers/serialisers for RDF syntaxes (Turtle, RDF/XML, JSON-LD, N-Triples, …), and **Fuseki** — a standalone HTTP server exposing SPARQL query, SPARQL Update, and the Graph Store Protocol over the network. *(documented — README, jena.apache.org; maintainer — andy@ 2026-06-01: "an HTTP-based data server (Fuseki) and a Java API to RDF data stored in memory and in a custom database")*

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON-LD is provided by a dependency.

There is work in the current JSON-LD W3C Working Group to document and provide mitigation for the issue that JSON-LD reads remote file.

It is safer than XML External Entities but nevertheless, it's an issue.

Comment thread THREAT_MODEL.md
- **authenticated user / admin** — gated by Apache Shiro (`shiro.ini`); admin functions (`/$/*`) restricted to localhost by default *(documented)*.
- **operator/deployer** — configures Shiro, datasets, TDB location, and which endpoints are read-only vs updatable. **Trusted.** *(inferred)*
- **embedding application** (Java API) — trusted; supplies queries/RDF to the library. *(inferred)*

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jena also provides a Lucene-based text index component including in Fuseki.
Should that be included here?

Comment thread THREAT_MODEL.md Outdated
Comment thread THREAT_MODEL.md Outdated
Comment thread THREAT_MODEL.md Outdated
Comment thread THREAT_MODEL.md

**Wave 2 — the high-value query surfaces (the Jena CVE classes):**
4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10.
5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In default configurations, FROM / FROM NAMED are URIs used as names of graph in the dataset - already accessible data via GRAPH - even for file:

file: does not read local storage.

Comment thread THREAT_MODEL.md
4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10.
5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9.
6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a.
7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDF/XML - XXE is disabled (JenaXMLInput).

JSON-LD has it's own version the XXE concern.

Comment thread THREAT_MODEL.md
7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8.

**Wave 3 — resources, API, meta:**
8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ought to couple with needing to enable SERVICE in Fuseki.

(the volume concern I mentioned is number of requests - it can flood a server)

Comment thread THREAT_MODEL.md

**Wave 3 — resources, API, meta:**
8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a.
9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fuseki itself does not have parameterised queries.

Comment thread THREAT_MODEL.md
**Wave 3 — resources, API, meta:**
8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a.
9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9.
10. Any other recurring scanner/fuzzer false positives to seed §11a? → §11a.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Item 1 should be for TDB1 only.

@potiuk

potiuk commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Thanks @afs and @rvesse — a genuinely expert pass; between you you've answered most of §14 and surfaced two real gaps I'd missed. Where each lands:

New surfaces (@afs):

  • jena-text (Lucene, incl. in Fuseki) — agreed, adding it to the component table as an in-scope engine component (L39).
  • JSON-LD remote-file reading (L27) — good catch; a real SSRF-adjacent surface (noting the in-flight W3C JSON-LD WG mitigation work). It joins SERVICE as an outbound-fetch lever in §6/§9 — disable/allow-list at the operator boundary.
  • SPARQL family + validation/extension modules (jena-shacl/shex/geosparql/serviceenhancer) and the store breakdown (jena-db framework, TDB1/TDB2/text) — folding your L32/L47/L49 in.

§14 answers (@rvesse), confirmed into the model:

  • SERVICE = the SSRF lever — disable if unused / egress allow-list when enabled (§9/§10).
  • ARQ JavaScript + custom Java functions = opt-in, requires explicit Fuseki and JVM/classpath config → operator-trusted code execution, not a default surface (§5a/§10).
  • Admin /$/* = localhost-only by default; exposing it needs auth config (§4).
  • DoS/resource (query timeout, request-size limits via reverse proxy) = operator config; known RDF/SPARQL-community traits, not Jena bugs (§9/§11a). In-process use = the app's responsibility to validate input + parameterised queries (§10).
  • TDB FAQs (the recurring memory-leak threads) — a perfect §11a "known non-findings" seed; I'll cite them.

On mechanics: I know you're holding your suggestion-commits so the PMC reviews the as-is draft — I won't push over them. Whenever you land your edits I'll fold @afs's new surfaces + the §14 confirmations on top so they don't collide; or I'm happy to roll it all into one consolidated revision for you to review. Your call. Thank you both — this is the input that makes the model triage-ready.

@potiuk

potiuk commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

@rvesse — just confirming the ball's in your court here, no rush from our side. I'll hold off pushing anything so the PMC can review the draft as-is. Once you land your suggestion-commits I'll fold them in — together with the jena-text (Lucene), JSON-LD remote-file SSRF, and SPARQL-family / TDB store-module items from the last round — on top of your edits rather than pushing over them. Ping the thread whenever they're in.

Co-authored-by: Rob Vesse <rvesse@dotnetrdf.org>
Co-authored-by: Andy Seaborne <andy@apache.org>
@rvesse

rvesse commented Jun 11, 2026

Copy link
Copy Markdown
Member

@rvesse — just confirming the ball's in your court here, no rush from our side. I'll hold off pushing anything so the PMC can review the draft as-is. Once you land your suggestion-commits I'll fold them in — together with the jena-text (Lucene), JSON-LD remote-file SSRF, and SPARQL-family / TDB store-module items from the last round — on top of your edits rather than pushing over them. Ping the thread whenever they're in.

@potiuk I think everyone who wants to has had a chance to review the first draft, I don't want us to agonise too much over getting this perfect since it'll evolve over time as we move through this process anyway. I've folded all of mine and @afs's suggestions into this PR. So feel free to go ahead with making your revisions so we have a 2nd draft to review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants