Add draft threat model + SECURITY.md/AGENTS.md discoverability by potiuk · Pull Request #3966 · apache/jena

potiuk · 2026-06-02T18:17:37Z

What this is

A draft threat model for Apache Jena, proposed by the ASF Security team for the Jena PMC to review, correct, or reject — drafted by the Security team's threat-model tooling from Jena's public docs and repository, following the ASF Security threat-model rubric. It was requested by the PMC (andy@) as a starting point.

This PR:

adds THREAT_MODEL.md — the draft model;
adds SECURITY.md — a short security policy linking the threat model;
adds AGENTS.md with a ## Security section, so the chain AGENTS.md → SECURITY.md → THREAT_MODEL.md is mechanically discoverable by automated security scanners.

How to read it

Every claim is provenance-tagged: (documented) (from Jena's docs/repo), (inferred) (reasoned from architecture, not yet confirmed), (maintainer) (confirmed by the PMC). This v0 is ~12 documented / ~34 inferred. The §14 Open questions section collects every inferred claim into waves — that is where review time is best spent. The model treats Fuseki's SPARQL endpoint as the primary boundary (public query, localhost-only admin by default are documented) and flags the high-value query surfaces for confirmation:

the SPARQL Update default (read-only vs update-enabled) — decides whether anonymous write is in-model (wave 1);
SERVICE federation (SSRF), file:/FROM local-file read, ARQ JavaScript/custom functions (code exec), and RDF/XML XXE in RIOT — is each prevented/restrictable by default? (wave 2);
the resource/DoS line (query timeout / result limits) — addressing the PMC's volume concern (wave 3).

Nothing here is a requirement — the model is for the PMC to own. Comment inline, edit the branch, or reply on the email thread.

Generated-by: Claude Code

rvesse

Thanks @potiuk for the first pass at this, I think this looks like a pretty solid starting point.

I have gone through the model and made various suggested edits throughout (mostly confirming/clarifying things you'd marked as needed that). I won't commit the edits yet as I want to give the rest of the PMC chance to review the initial draft as-is

Have also provided initial answers for most of the questions. Some of those answers are just me pinging other PMC members with the relevant expertise in a particular area of the codebase to provide their input

rvesse · 2026-06-03T09:22:10Z

+**Wave 2 — the high-value query surfaces (the Jena CVE classes):**
+4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10.
+5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9.
+6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a.


Yes, opt-in and explicit white list for permitted JS functions.

For custom Java functions operator has to explicitly add code to their class path so operator responsibility to verify they trust the custom function code

Code execution is by-design-operator-enabled

rvesse · 2026-06-03T09:22:44Z

+4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10.
+5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9.
+6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a.
+7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8.


I believe so, this is @afs's area of expertise having rewritten those parsers relatively recently

RDF/XML - XXE is disabled (JenaXMLInput).

JSON-LD has it's own version the XXE concern.

rvesse · 2026-06-03T09:23:49Z

+7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8.
+
+**Wave 3 — resources, API, meta:**
+8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a.


I don't think we can treat these are bugs, these are known issues in the wider RDF/SPARQL community and its operator responsibility to apply configuration (e.g. query timeout), request size limits via reverse proxy etc.

We ought to couple with needing to enable SERVICE in Fuseki.

(the volume concern I mentioned is number of requests - it can flood a server)

rvesse · 2026-06-03T09:24:43Z

+
+**Wave 3 — resources, API, meta:**
+8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a.
+9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9.


Yes, for in-process its app responsibility to verify untrusted inputs and apply any appropriate hardening

Parameterised queries are recommended pattern

Fuseki itself does not have parameterised queries.

rvesse · 2026-06-03T09:32:17Z

+**Wave 3 — resources, API, meta:**
+8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a.
+9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9.
+10. Any other recurring scanner/fuzzer false positives to seed §11a? → §11a.


Probably worth reviewing the TDB FAQs page, the following two in particular come to mind that are recurring topics on the mailing lists:

Does Fuseki/TDB have a memory leak? - Unbounded memory growth under continuous read/write load is a known issue and our use of a WAL ensures no data is lost should the process crash/restart due to this

Why is the database much large on disk than my input? - This is two-fold, firstly we use sparse files so depending on how disk usage is inspected (and the filesystem in use) different usage metrics can be reported. Secondly TDB2 uses MVCC trees so each write transaction potentially creates new blocks in the trees orphaning the old blocks (once any active read transactions on the old tree state have completed), this is expected behaviour and we provide a compaction operation that we recommend operators run periodically to reclaim disk space.

Item 1 should be for TDB1 only.

potiuk · 2026-06-04T02:15:31Z

Thanks @rvesse — genuinely useful, detailed review. Understood you're holding your own commits so the rest of the PMC can review the as-is draft first, so I won't push anything over that.

For when you're ready: I've staged a revision incorporating all your suggestions — SSRF via SERVICE documented as a VALID vector (with the "no allow-list today" note), FROM/file: reworded as dataset-implementation-dependent (TDB2 restricts to dataset graphs), the "super-linear" DoS framing removed (operator-tuned, affects all compliant engines), the ARQ-JS (opt-in + eval-blacklisted) vs Java-custom-function (operator-classpath, by-design) distinction, and the TDB-FAQ resource references. It's ready to land whenever the PMC's done with the as-is draft — just say the word. (And afs@ can confirm the RDF/XML XXE-default question when convenient — that's the one item I left open.) No rush.

arne-bdt · 2026-06-04T10:12:35Z

I flew over the files and have nothing to add so far.

(Yesterday, I realized I am overworked since a few days and need some time off, to be able to think straight again)

potiuk · 2026-06-05T00:58:54Z

Thanks @arne-bdt — appreciate the look. And genuinely, take the time off you need; there's zero rush here, the draft will keep. Be well. 🙂

afs · 2026-06-06T13:20:09Z

+- **Reporting cross-reference:** §8-property violations → report privately per ASF process (`security@apache.org` → `private@jena.apache.org`); §3/§9 findings are closed citing this document.
+- **Provenance legend:** *(documented)* = Jena's own docs/repo; *(maintainer)* = confirmed by a Jena PMC member through this process (andy@ has ratified destination + the help-with-model request); *(inferred)* = reasoned from architecture, not yet confirmed — each has a matching §14 open question.
+- **Draft confidence:** ~12 documented / ~2 maintainer / ~34 inferred.
+- **What Jena is:** Apache Jena is a Java framework for building Semantic-Web / linked-data applications over RDF. It provides an in-process API to RDF data held in memory or in a native store (TDB), the ARQ SPARQL query/update engine, RIOT parsers/serialisers for RDF syntaxes (Turtle, RDF/XML, JSON-LD, N-Triples, …), and **Fuseki** — a standalone HTTP server exposing SPARQL query, SPARQL Update, and the Graph Store Protocol over the network. *(documented — README, jena.apache.org; maintainer — andy@ 2026-06-01: "an HTTP-based data server (Fuseki) and a Java API to RDF data stored in memory and in a custom database")*


JSON-LD is provided by a dependency.

There is work in the current JSON-LD W3C Working Group to document and provide mitigation for the issue that JSON-LD reads remote file.

It is safer than XML External Entities but nevertheless, it's an issue.

afs · 2026-06-06T13:21:37Z

+  - **authenticated user / admin** — gated by Apache Shiro (`shiro.ini`); admin functions (`/$/*`) restricted to localhost by default *(documented)*.
+  - **operator/deployer** — configures Shiro, datasets, TDB location, and which endpoints are read-only vs updatable. **Trusted.** *(inferred)*
+  - **embedding application** (Java API) — trusted; supplies queries/RDF to the library. *(inferred)*
+


Jena also provides a Lucene-based text index component including in Fuseki.
Should that be included here?

afs · 2026-06-06T13:56:14Z

+
+**Wave 2 — the high-value query surfaces (the Jena CVE classes):**
+4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10.
+5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9.


In default configurations, FROM / FROM NAMED are URIs used as names of graph in the dataset - already accessible data via GRAPH - even for file:

file: does not read local storage.

afs · 2026-06-06T13:57:27Z

+4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10.
+5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9.
+6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a.
+7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8.


RDF/XML - XXE is disabled (JenaXMLInput).

JSON-LD has it's own version the XXE concern.

afs · 2026-06-06T13:59:10Z

+7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8.
+
+**Wave 3 — resources, API, meta:**
+8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a.


We ought to couple with needing to enable SERVICE in Fuseki.

(the volume concern I mentioned is number of requests - it can flood a server)

afs · 2026-06-06T13:59:58Z

+
+**Wave 3 — resources, API, meta:**
+8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a.
+9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9.


Fuseki itself does not have parameterised queries.

afs · 2026-06-06T14:01:24Z

+**Wave 3 — resources, API, meta:**
+8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a.
+9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9.
+10. Any other recurring scanner/fuzzer false positives to seed §11a? → §11a.


Item 1 should be for TDB1 only.

potiuk · 2026-06-11T01:05:50Z

Thanks @afs and @rvesse — a genuinely expert pass; between you you've answered most of §14 and surfaced two real gaps I'd missed. Where each lands:

New surfaces (@afs):

jena-text (Lucene, incl. in Fuseki) — agreed, adding it to the component table as an in-scope engine component (L39).
JSON-LD remote-file reading (L27) — good catch; a real SSRF-adjacent surface (noting the in-flight W3C JSON-LD WG mitigation work). It joins SERVICE as an outbound-fetch lever in §6/§9 — disable/allow-list at the operator boundary.
SPARQL family + validation/extension modules (jena-shacl/shex/geosparql/serviceenhancer) and the store breakdown (jena-db framework, TDB1/TDB2/text) — folding your L32/L47/L49 in.

§14 answers (@rvesse), confirmed into the model:

SERVICE = the SSRF lever — disable if unused / egress allow-list when enabled (§9/§10).
ARQ JavaScript + custom Java functions = opt-in, requires explicit Fuseki and JVM/classpath config → operator-trusted code execution, not a default surface (§5a/§10).
Admin /$/* = localhost-only by default; exposing it needs auth config (§4).
DoS/resource (query timeout, request-size limits via reverse proxy) = operator config; known RDF/SPARQL-community traits, not Jena bugs (§9/§11a). In-process use = the app's responsibility to validate input + parameterised queries (§10).
TDB FAQs (the recurring memory-leak threads) — a perfect §11a "known non-findings" seed; I'll cite them.

On mechanics: I know you're holding your suggestion-commits so the PMC reviews the as-is draft — I won't push over them. Whenever you land your edits I'll fold @afs's new surfaces + the §14 confirmations on top so they don't collide; or I'm happy to roll it all into one consolidated revision for you to review. Your call. Thank you both — this is the input that makes the model triage-ready.

potiuk · 2026-06-11T01:49:43Z

@rvesse — just confirming the ball's in your court here, no rush from our side. I'll hold off pushing anything so the PMC can review the draft as-is. Once you land your suggestion-commits I'll fold them in — together with the jena-text (Lucene), JSON-LD remote-file SSRF, and SPARQL-family / TDB store-module items from the last round — on top of your edits rather than pushing over them. Ping the thread whenever they're in.

Co-authored-by: Rob Vesse <rvesse@dotnetrdf.org> Co-authored-by: Andy Seaborne <andy@apache.org>

rvesse · 2026-06-11T08:36:40Z

@rvesse — just confirming the ball's in your court here, no rush from our side. I'll hold off pushing anything so the PMC can review the draft as-is. Once you land your suggestion-commits I'll fold them in — together with the jena-text (Lucene), JSON-LD remote-file SSRF, and SPARQL-family / TDB store-module items from the last round — on top of your edits rather than pushing over them. Ping the thread whenever they're in.

@potiuk I think everyone who wants to has had a chance to review the first draft, I don't want us to agonise too much over getting this perfect since it'll evolve over time as we move through this process anyway. I've folded all of mine and @afs's suggestions into this PR. So feel free to go ahead with making your revisions so we have a 2nd draft to review

Add draft threat model + SECURITY.md/AGENTS.md discoverability

8e1f271

Generated-by: Claude Code

rvesse reviewed Jun 3, 2026

View reviewed changes

rvesse requested review from afs and arne-bdt June 3, 2026 13:41

afs reviewed Jun 6, 2026

View reviewed changes

Apply suggestions from code review

530406b

Co-authored-by: Rob Vesse <rvesse@dotnetrdf.org> Co-authored-by: Andy Seaborne <andy@apache.org>

Conversation

potiuk commented Jun 2, 2026

What this is

How to read it

Uh oh!

rvesse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

potiuk commented Jun 4, 2026

Uh oh!

arne-bdt commented Jun 4, 2026

Uh oh!

potiuk commented Jun 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

potiuk commented Jun 11, 2026

Uh oh!

potiuk commented Jun 11, 2026

Uh oh!

rvesse commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants