Add draft threat model + SECURITY.md/AGENTS.md discoverability#3966
Add draft threat model + SECURITY.md/AGENTS.md discoverability#3966potiuk wants to merge 2 commits into
Conversation
Generated-by: Claude Code
rvesse
left a comment
There was a problem hiding this comment.
Thanks @potiuk for the first pass at this, I think this looks like a pretty solid starting point.
I have gone through the model and made various suggested edits throughout (mostly confirming/clarifying things you'd marked as needed that). I won't commit the edits yet as I want to give the rest of the PMC chance to review the initial draft as-is
Have also provided initial answers for most of the questions. Some of those answers are just me pinging other PMC members with the relevant expertise in a particular area of the codebase to provide their input
| **Wave 2 — the high-value query surfaces (the Jena CVE classes):** | ||
| 4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10. | ||
| 5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9. | ||
| 6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a. |
There was a problem hiding this comment.
Yes, opt-in and explicit white list for permitted JS functions.
For custom Java functions operator has to explicitly add code to their class path so operator responsibility to verify they trust the custom function code
Code execution is by-design-operator-enabled
| 4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10. | ||
| 5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9. | ||
| 6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a. | ||
| 7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8. |
There was a problem hiding this comment.
I believe so, this is @afs's area of expertise having rewritten those parsers relatively recently
There was a problem hiding this comment.
RDF/XML - XXE is disabled (JenaXMLInput).
JSON-LD has it's own version the XXE concern.
| 7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8. | ||
|
|
||
| **Wave 3 — resources, API, meta:** | ||
| 8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. |
There was a problem hiding this comment.
I don't think we can treat these are bugs, these are known issues in the wider RDF/SPARQL community and its operator responsibility to apply configuration (e.g. query timeout), request size limits via reverse proxy etc.
There was a problem hiding this comment.
We ought to couple with needing to enable SERVICE in Fuseki.
(the volume concern I mentioned is number of requests - it can flood a server)
|
|
||
| **Wave 3 — resources, API, meta:** | ||
| 8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. | ||
| 9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9. |
There was a problem hiding this comment.
Yes, for in-process its app responsibility to verify untrusted inputs and apply any appropriate hardening
Parameterised queries are recommended pattern
There was a problem hiding this comment.
Fuseki itself does not have parameterised queries.
| **Wave 3 — resources, API, meta:** | ||
| 8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. | ||
| 9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9. | ||
| 10. Any other recurring scanner/fuzzer false positives to seed §11a? → §11a. |
There was a problem hiding this comment.
Probably worth reviewing the TDB FAQs page, the following two in particular come to mind that are recurring topics on the mailing lists:
- Does Fuseki/TDB have a memory leak? - Unbounded memory growth under continuous read/write load is a known issue and our use of a WAL ensures no data is lost should the process crash/restart due to this
- Why is the database much large on disk than my input? - This is two-fold, firstly we use sparse files so depending on how disk usage is inspected (and the filesystem in use) different usage metrics can be reported. Secondly TDB2 uses MVCC trees so each write transaction potentially creates new blocks in the trees orphaning the old blocks (once any active read transactions on the old tree state have completed), this is expected behaviour and we provide a compaction operation that we recommend operators run periodically to reclaim disk space.
|
Thanks @rvesse — genuinely useful, detailed review. Understood you're holding your own commits so the rest of the PMC can review the as-is draft first, so I won't push anything over that. For when you're ready: I've staged a revision incorporating all your suggestions — SSRF via |
|
I flew over the files and have nothing to add so far. (Yesterday, I realized I am overworked since a few days and need some time off, to be able to think straight again) |
|
Thanks @arne-bdt — appreciate the look. And genuinely, take the time off you need; there's zero rush here, the draft will keep. Be well. 🙂 |
| - **Reporting cross-reference:** §8-property violations → report privately per ASF process (`security@apache.org` → `private@jena.apache.org`); §3/§9 findings are closed citing this document. | ||
| - **Provenance legend:** *(documented)* = Jena's own docs/repo; *(maintainer)* = confirmed by a Jena PMC member through this process (andy@ has ratified destination + the help-with-model request); *(inferred)* = reasoned from architecture, not yet confirmed — each has a matching §14 open question. | ||
| - **Draft confidence:** ~12 documented / ~2 maintainer / ~34 inferred. | ||
| - **What Jena is:** Apache Jena is a Java framework for building Semantic-Web / linked-data applications over RDF. It provides an in-process API to RDF data held in memory or in a native store (TDB), the ARQ SPARQL query/update engine, RIOT parsers/serialisers for RDF syntaxes (Turtle, RDF/XML, JSON-LD, N-Triples, …), and **Fuseki** — a standalone HTTP server exposing SPARQL query, SPARQL Update, and the Graph Store Protocol over the network. *(documented — README, jena.apache.org; maintainer — andy@ 2026-06-01: "an HTTP-based data server (Fuseki) and a Java API to RDF data stored in memory and in a custom database")* |
There was a problem hiding this comment.
JSON-LD is provided by a dependency.
There is work in the current JSON-LD W3C Working Group to document and provide mitigation for the issue that JSON-LD reads remote file.
It is safer than XML External Entities but nevertheless, it's an issue.
| - **authenticated user / admin** — gated by Apache Shiro (`shiro.ini`); admin functions (`/$/*`) restricted to localhost by default *(documented)*. | ||
| - **operator/deployer** — configures Shiro, datasets, TDB location, and which endpoints are read-only vs updatable. **Trusted.** *(inferred)* | ||
| - **embedding application** (Java API) — trusted; supplies queries/RDF to the library. *(inferred)* | ||
|
|
There was a problem hiding this comment.
Jena also provides a Lucene-based text index component including in Fuseki.
Should that be included here?
|
|
||
| **Wave 2 — the high-value query surfaces (the Jena CVE classes):** | ||
| 4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10. | ||
| 5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9. |
There was a problem hiding this comment.
In default configurations, FROM / FROM NAMED are URIs used as names of graph in the dataset - already accessible data via GRAPH - even for file:
file: does not read local storage.
| 4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10. | ||
| 5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9. | ||
| 6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a. | ||
| 7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8. |
There was a problem hiding this comment.
RDF/XML - XXE is disabled (JenaXMLInput).
JSON-LD has it's own version the XXE concern.
| 7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8. | ||
|
|
||
| **Wave 3 — resources, API, meta:** | ||
| 8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. |
There was a problem hiding this comment.
We ought to couple with needing to enable SERVICE in Fuseki.
(the volume concern I mentioned is number of requests - it can flood a server)
|
|
||
| **Wave 3 — resources, API, meta:** | ||
| 8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. | ||
| 9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9. |
There was a problem hiding this comment.
Fuseki itself does not have parameterised queries.
| **Wave 3 — resources, API, meta:** | ||
| 8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. | ||
| 9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9. | ||
| 10. Any other recurring scanner/fuzzer false positives to seed §11a? → §11a. |
|
Thanks @afs and @rvesse — a genuinely expert pass; between you you've answered most of §14 and surfaced two real gaps I'd missed. Where each lands: New surfaces (@afs):
§14 answers (@rvesse), confirmed into the model:
On mechanics: I know you're holding your suggestion-commits so the PMC reviews the as-is draft — I won't push over them. Whenever you land your edits I'll fold @afs's new surfaces + the §14 confirmations on top so they don't collide; or I'm happy to roll it all into one consolidated revision for you to review. Your call. Thank you both — this is the input that makes the model triage-ready. |
|
@rvesse — just confirming the ball's in your court here, no rush from our side. I'll hold off pushing anything so the PMC can review the draft as-is. Once you land your suggestion-commits I'll fold them in — together with the jena-text (Lucene), JSON-LD remote-file SSRF, and SPARQL-family / TDB store-module items from the last round — on top of your edits rather than pushing over them. Ping the thread whenever they're in. |
Co-authored-by: Rob Vesse <rvesse@dotnetrdf.org> Co-authored-by: Andy Seaborne <andy@apache.org>
@potiuk I think everyone who wants to has had a chance to review the first draft, I don't want us to agonise too much over getting this perfect since it'll evolve over time as we move through this process anyway. I've folded all of mine and @afs's suggestions into this PR. So feel free to go ahead with making your revisions so we have a 2nd draft to review |
What this is
A draft threat model for Apache Jena, proposed by the ASF Security team for the Jena PMC to review, correct, or reject — drafted by the Security team's threat-model tooling from Jena's public docs and repository, following the ASF Security threat-model rubric. It was requested by the PMC (andy@) as a starting point.
This PR:
THREAT_MODEL.md— the draft model;SECURITY.md— a short security policy linking the threat model;AGENTS.mdwith a## Securitysection, so the chainAGENTS.md → SECURITY.md → THREAT_MODEL.mdis mechanically discoverable by automated security scanners.How to read it
Every claim is provenance-tagged: (documented) (from Jena's docs/repo), (inferred) (reasoned from architecture, not yet confirmed), (maintainer) (confirmed by the PMC). This v0 is ~12 documented / ~34 inferred. The §14 Open questions section collects every inferred claim into waves — that is where review time is best spent. The model treats Fuseki's SPARQL endpoint as the primary boundary (public query, localhost-only admin by default are documented) and flags the high-value query surfaces for confirmation:
SERVICEfederation (SSRF),file:/FROM local-file read, ARQ JavaScript/custom functions (code exec), and RDF/XML XXE in RIOT — is each prevented/restrictable by default? (wave 2);Nothing here is a requirement — the model is for the PMC to own. Comment inline, edit the branch, or reply on the email thread.