RAT-558: Add security threat model (THREAT_MODEL.md + SECURITY.md + AGENTS.md) by potiuk · Pull Request #677 · apache/creadur-rat

potiuk · 2026-06-10T17:38:14Z

What

Adds a threat model for Apache Creadur (RAT) at the Creadur PMC's request (GLASSWING / Mythos scan pre-flight):

THREAT_MODEL.md — the model (rubric).
SECURITY.md + AGENTS.md — disclosure pointer + the AGENTS.md -> SECURITY.md -> THREAT_MODEL.md chain.

The model in brief

RAT is modelled as an in-process build/CLI license-audit tool — not a network service, and explicitly not a security/vulnerability scanner. Its security-relevant case is auditing untrusted input: the XML configuration (XXE surface) and archive descent (decompression-bomb surface). Findings that require RAT to process input the operator already trusts (the normal case — your own source tree) are out of model.

DRAFT — you own it; two quick technical confirmations

Because RAT is small, the §8-vs-§9 split hinges on two facts I've left as section 14 questions:

Q3 — does XMLConfigurationReader disable DOCTYPE/external entities (XXE-safe)?
Q4 — does ArchiveWalker bound decompression (size/depth/entry-count)?

Your answers turn those from "open question" into either a provided property (§8) or a documented gap + downstream note (§9). Also Q6: want me to add the same chain to creadur-whisker and creadur-tentacles so all three are discoverable?

Generated by the ASF Security team's threat-model tooling (Claude Opus); reviewed before opening.

Rebased onto current master, which already added AGENTS.md and SECURITY.md. Keeps both maintainer files and adds the detailed THREAT_MODEL.md plus the AGENTS.md -> SECURITY.md -> THREAT_MODEL.md pointers. Generated-by: Claude Opus 4.8 (1M context)

Claudenw · 2026-06-15T14:40:44Z

This PR looks like it needs answers from developers before submitting.
@potiuk is that the expected flow? We answer questions and push the resulting file(s) to the repo?

potiuk · 2026-06-15T22:02:58Z

Yes. Absolutely. - it's enough we just comment in the PR answering the questions and I will update the PR accordingly

ottlinger · 2026-06-16T12:34:31Z

+or a **Maven plugin** — always **in the developer's or CI's own process**,
+never as a network service. Whisker generates license documentation; Tentacles
+inspects staged release bundles. None is a server.
+


Would it make sense to add a new notion here as RAT can be used to change your own sources to include license headers? In this way user input can be altered or is this not relevant from a security-scope?

ottlinger · 2026-06-16T12:37:10Z

+  when RAT is deliberately run on your own (trusted) code — the dominant,
+  intended case. Findings whose only impact requires running RAT on input you
+  already trust are `OUT-OF-MODEL: trusted-input`.
+- **Test resources** (the deliberately-odd license fixtures under


Would it make sense to add documentation that Maven/Ant/CLI options is generated, as this would mean that security implications are automatically transferred to all of RAT UIs?

ottlinger · 2026-06-16T12:40:12Z

+
+## §15 Appendix — existing-policy back-map
+
+No in-repo `SECURITY.md` exists today; this PR adds one (ASF security-process


a very basic security file was introduced via
#671

Claudenw · 2026-06-16T16:03:46Z

@potiuk

In answer to the first question, there is a PR to ensure we have this covered.
In answerto the second question, we utilize Apache Common Compress ArchiveStreamFactory class to create archive streams. If the filters indicate that we should process the archived file we extract the contents into an in-memory buffer where it is held until the document is processed. So we could hit OOM.

Incorporates the Creadur PMC's PR apache#677 review: - archive walker confirmed unbounded (in-memory extraction) -> §9 gap + §10 - XML/DOCTYPE hardening noted as in-flight PMC PR (§14 Q3, link pending) - documents RAT write mode (--addLicense) as trusted-input / out-of-model - notes CLI/Ant/Maven front-ends are generated from a common core - §15 corrected: SECURITY.md already exists (added via apache#671) Generated-by: Claude Opus 4.8

potiuk · 2026-06-17T01:28:18Z

Thanks Claudenw and ottlinger — folded your review into the draft and pushed (THREAT_MODEL.md, +53/-21):

Archive handling (your Q on OOM): confirmed as a gap. The unbounded in-memory extraction via Commons Compress ArchiveStreamFactory is now §9 (not a provided property), with the matching §10 downstream responsibility (sandbox / resource-limit RAT runs over untrusted archives) and a VALID-HARDENING disposition. Marked maintainer-confirmed.
XML/DOCTYPE (your "there is a PR to ensure we have this covered"): I read that as the XXE-hardening question (§14 Q3). I've noted a hardening PR is in flight but left §8 Gradle RAT: Add plainOutput, xmlOutput and htmlOutput options #2 tentative pending the link — could you drop the PR number here so I can cite it? Once it lands I'll flip XXE to a provided property.
ottlinger — write mode: good catch. RAT's --addLicense / editor mode writes headers into the audited tree; I've documented it explicitly in §2/§3 as operator-invoked on the operator's own (trusted) sources → it's OUT-OF-MODEL: trusted-input, but now stated rather than silent.
ottlinger — generated front-ends: added a §2 note that CLI/Ant/Maven are generated from a common option core, so a security property (or gap) in the core transfers to all three UIs.
ottlinger — SECURITY.md / RAT-558: Add proposals for AGENTS and SECURITY configuration - to be discussed #671: thanks, fixed the stale §15 line — it now reflects that RAT-558: Add proposals for AGENTS and SECURITY configuration - to be discussed #671 added SECURITY.md and this PR just appends the AGENTS.md -> SECURITY.md -> THREAT_MODEL.md pointer.

Still open if you have a moment (one line each is plenty): Q1 (confirm the untrusted-input case is the one to model), Q2 (RAT makes no network connections), Q5 (Whisker/Tentacles share the profile), and Q6 (want us to add the same pointer files to creadur-whisker/-tentacles, or will you?).

One note on CI: the failing "Build and analyze" (CodeQL) check is unrelated to this PR — it's a docs-only change (three .md files), so it isn't introducing or affected by that build job; looks pre-existing/flaky on the branch.

Claudenw · 2026-06-17T07:02:37Z

@potiuk There is one more point that has not been discussed. RAT allows developers to extend the matching algorithms. See https://creadur.apache.org/rat/license_def.html#Matchers

The upshot is that 3rd parties can create new matchers and use them in license checks. Matchers are different from license checks in that license checks use matchers. For example the Apache 2.0 license check uses the text, spdx and any matchers.

Matchers scan the contents of the file (as a String) looking for matches. This means that a custom matcher would have access to all text from all files that are selected for scanning.

But this is defined in the configuration and is under control of the developer using RAT.

ottlinger · 2026-06-17T12:37:47Z

@potiuk the Sonarbuild does only run on specific branches/with specific PRs as the credentials are not shared among all PRs/builds due to ASF restrictions.

Claudenw · 2026-06-17T16:16:44Z

could you drop the PR number here so I can cite it? Once it lands I'll flip XXE to a provided property.

#679 is the PR that does the XXE hardening.

Per Claudenw (PR apache#677): RAT lets operators define custom matcher classes that see all scanned file text, but the matcher set is operator-defined config (not attacker-supplied), so it's OUT-OF-MODEL: trusted-input — same posture as the write mode. Generated-by: Claude Opus 4.8 (1M context)

potiuk · 2026-06-18T22:40:20Z

Thanks Claudenw — folded the custom-matcher surface into §3: RAT lets operators define custom matcher classes that see the full text of every scanned file, but since the matcher set is operator-defined config (not attacker-supplied), a custom matcher reading scanned text is OUT-OF-MODEL: trusted-input — the same posture as the write mode. Marked maintainer-confirmed.

ottlinger — thanks for the CI clarification; that matches what we expected (the Sonar/CodeQL job is credential-gated to specific branches under ASF restrictions, so its red on this fork PR is environmental, not anything this docs-only change introduces).

Still one open item: the XXE-hardening PR number (§14 Q3) — I've left §8 #2 tentative pending it. Whenever you drop the number I'll cite it and flip XXE from "hardening in flight" to a provided property. No rush.

The remaining §14 questions (Q1 untrusted-input posture, Q2 no-network, Q5 Whisker/Tentacles profile, Q6 sibling pointer files) are still open whenever convenient — one line each is plenty.

ottlinger · 2026-06-19T09:27:24Z

@potiuk - thanks again:

Q5:as development on Tentacles/Whisker is rather low at the moment I'd personally prefer to start with RAT and add Tentacles/Whisker later in order to reduce noise.
Q2: AFAIK RAT does not open network connections as it runs locally; except for downloading stuff via Maven/Ant/buildtools in order to build the artifacts locally.

Claudenw · 2026-06-19T11:01:22Z

+sometimes pointed at **untrusted input**: a CI job auditing an untrusted
+contribution/PR, or auditing a downloaded third-party artifact. That is the
+case the model cares about. *(inferred — Q1.)*
+


This statement is correct

Claudenw · 2026-06-19T11:04:32Z

+caller invokes RAT (CLI/Ant/Maven) on a directory + a config
+   │ trusted invocation
+   ▼
+read configuration (XMLConfigurationReader) ── XXE surface if config is untrusted


#679 is the PR that does the XXE hardening. I don't know if that impacts here.

Claudenw · 2026-06-19T11:08:02Z

+
+- A JRE; RAT reads the filesystem it is pointed at and writes a report. It opens
+  **no network connections** and runs no services. *(inferred — Q2, the
+  no-network claim is high-value to confirm.)*


True, no network connections are opened by RAT. RAT only opens files. One potential hole in this is XSLT transforms where the operator could add an xsl:include statement to open a connection to a remote system. This is out of scope as the XSLT are in the trusted space under control of the operator.

Claudenw · 2026-06-19T11:10:47Z

+  **no network connections** and runs no services. *(inferred — Q2, the
+  no-network claim is high-value to confirm.)*
+- The XML parser behaviour depends on the platform JAXP unless RAT configures it
+  (§5a/§8). *(inferred — Q3.)*


It depends upon JAXP an can be configured through the JAXP environment variables as documented:
https://docs.oracle.com/javase/8/docs/technotes/guides/security/jaxp/jaxp.html#setting-jaxp-properties-as-system-properties

Claudenw · 2026-06-19T11:11:37Z

+(§8/§9, maintainer-confirmed). XML-parser DOCTYPE handling is being hardened via
+a PMC PR (§14 Q3). There is no "insecure default toggle". *(maintainer / Q3
+pending PR link.)*
+


See above for PR link

Claudenw · 2026-06-19T11:16:29Z

+| Input | Attacker-controllable? (untrusted-run) | Concern |
+| --- | --- | --- |
+| scanned file content | **yes** | parsed/read; resource use |
+| scanned file paths / archive entry names | **yes** | path handling on archive extraction |


What does "path handling on archive extraction" mean? We do not extract the data into a directory. We read the files from the archive and extract them from there. The file paths are documented as relative to the archive so something like "/bar/baz.zip#/junk.txt" is reported for a file junk.txt in the archive baz.zip found in the /some/dir/bar/ directory on a unix/mac system where RAT was pointed to /some/dir as the tree to scan.

But the contents of junk.txt was only every extracted to memory.

Claudenw · 2026-06-19T11:31:16Z

+  trusted caller; inputs are normally trusted, but the security-relevant case is
+  RAT auditing **untrusted** input (CI on untrusted PRs, third-party artifacts).
+  Is that the case you want modelled, or do you consider all RAT input trusted
+  (which would move XXE/archive items to `OUT-OF-MODEL: trusted-input`)? (§2/§7.)


All rat configuration items (XSLT stylesheets, configuration files, license definitions, matcher implementations) are trusted and under control of the operator.

The files that are read may be untrusted, as you point out, in the case of verification of PRs from 3rd parties.

Attack surface is anything that can break out of the scanning stream when the system is run with the default settings. I am certain that there are settings that could open the system up for attack, for example the JAXP environment vars.s

Claudenw · 2026-06-19T11:31:43Z

+- **Q3.** *(Partially answered — PMC, PR #677: a hardening PR is in flight
+  ensuring DOCTYPE / external-entity handling is covered. **Pending the PR link
+  to cite**; once landed §8 #2 becomes a provided property.)* Does
+  `XMLConfigurationReader` disable DOCTYPE / external entities (XXE-safe)?


External entities are disabled.

Claudenw · 2026-06-19T11:32:59Z

+  in-memory buffer (Commons Compress `ArchiveStreamFactory`) held until the
+  document is processed, so a crafted archive can OOM. Resolved as a §9 gap +
+  §10 responsibility; §8 #1 is **not** a provided property.)* Does
+  `ArchiveWalker` bound decompression (size/depth/entry-count)?


We do not guard against OOM, we probably should add a limit but at this time we do not. This is out of scope and not provided.

Claudenw · 2026-06-19T11:33:09Z

+**Wave 2 — surface.**
+
+- **Q2.** Confirm RAT makes no network connections and has no side effects beyond
+  reading the scanned tree and writing the report. (§5/§8.)


Confirmed above.

…d (external entities disabled, apache#679 hardens DOCTYPE), no-network confirmed (XSLT xsl:include caveat), correct archive path-handling (read to memory, no extract-to-disk → no path traversal), Whisker/Tentacles deferred

potiuk · 2026-06-21T19:20:48Z

Thanks @ottlinger and @Claudenw — that's everything we needed. I've folded all your answers into THREAT_MODEL.md (pushed). Summary of what changed:

Trust posture (Q1) — confirmed and now (maintainer): RAT config (XSLT, config files, license defs, custom matchers) is operator-trusted; the scanned files are the untrusted surface. §2/§7.
No network (Q2) — confirmed (maintainer). Added your XSLT xsl:include nuance: the one operator-reachable way out is a remote xsl:include, and since stylesheets are trusted config that's OUT-OF-MODEL. §5/§8 Upgrade Apache Commons Collections to v3.2.2 #3.
XXE (Q3) — external entities are disabled, so §8 Gradle RAT: Add plainOutput, xmlOutput and htmlOutput options #2 is now a provided property (was tentative); noted PR RAT-560: changes to reduce XXE exposure #679 as the DOCTYPE-hardening follow-up, and the JAXP-system-properties configurability. §5/§5a/§8.
Archive bound (Q4) — kept as a disclaimed §9 gap (no bound, OOM not guarded).
Path handling — corrected a phantom risk: since RAT reads entries into memory and never extracts to disk, there's no zip-slip / path-traversal-on-write surface. An entry label like bar/baz.zip#/junk.txt is just a report string. §6/§9.
Whisker/Tentacles (Q5/Q6) — scoped this PR to creadur-rat per your preference; the sibling pointer files are a deferred follow-up.

With every §14 question answered, the model is ready to ratify whenever the PMC's happy with it. (The red check is the CodeQL "Build and analyze" job, which is unrelated to these doc-only changes — all 13 build/test matrix jobs pass.)

ottlinger changed the title ~~Add security threat model (THREAT_MODEL.md + SECURITY.md + AGENTS.md)~~ RAT-558: Add security threat model (THREAT_MODEL.md + SECURITY.md + AGENTS.md) Jun 11, 2026

potiuk force-pushed the asf-security/threat-model-2026-06-10 branch from 35879b0 to d4f0fdd Compare June 14, 2026 01:19

ottlinger reviewed Jun 16, 2026

View reviewed changes

ottlinger requested a review from Claudenw June 16, 2026 12:40

Claudenw reviewed Jun 19, 2026

View reviewed changes


		## §15 Appendix — existing-policy back-map

		No in-repo `SECURITY.md` exists today; this PR adds one (ASF security-process

Conversation

potiuk commented Jun 10, 2026

What

The model in brief

DRAFT — you own it; two quick technical confirmations

Uh oh!

Claudenw commented Jun 15, 2026

Uh oh!

potiuk commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Claudenw commented Jun 16, 2026

Uh oh!

potiuk commented Jun 17, 2026

Uh oh!

Claudenw commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ottlinger commented Jun 17, 2026

Uh oh!

Claudenw commented Jun 17, 2026

Uh oh!

potiuk commented Jun 18, 2026

Uh oh!

ottlinger commented Jun 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

potiuk commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

potiuk commented Jun 15, 2026 •

edited

Loading

Claudenw commented Jun 17, 2026 •

edited

Loading