Skip to content

RAT-558: Add security threat model (THREAT_MODEL.md + SECURITY.md + AGENTS.md)#677

Open
potiuk wants to merge 4 commits into
apache:masterfrom
potiuk:asf-security/threat-model-2026-06-10
Open

RAT-558: Add security threat model (THREAT_MODEL.md + SECURITY.md + AGENTS.md)#677
potiuk wants to merge 4 commits into
apache:masterfrom
potiuk:asf-security/threat-model-2026-06-10

Conversation

@potiuk

@potiuk potiuk commented Jun 10, 2026

Copy link
Copy Markdown
Member

What

Adds a threat model for Apache Creadur (RAT) at the Creadur PMC's request (GLASSWING / Mythos scan pre-flight):

  • THREAT_MODEL.md — the model (rubric).
  • SECURITY.md + AGENTS.md — disclosure pointer + the AGENTS.md -> SECURITY.md -> THREAT_MODEL.md chain.

The model in brief

RAT is modelled as an in-process build/CLI license-audit tool — not a network service, and explicitly not a security/vulnerability scanner. Its security-relevant case is auditing untrusted input: the XML configuration (XXE surface) and archive descent (decompression-bomb surface). Findings that require RAT to process input the operator already trusts (the normal case — your own source tree) are out of model.

DRAFT — you own it; two quick technical confirmations

Because RAT is small, the §8-vs-§9 split hinges on two facts I've left as section 14 questions:

  • Q3 — does XMLConfigurationReader disable DOCTYPE/external entities (XXE-safe)?
  • Q4 — does ArchiveWalker bound decompression (size/depth/entry-count)?

Your answers turn those from "open question" into either a provided property (§8) or a documented gap + downstream note (§9). Also Q6: want me to add the same chain to creadur-whisker and creadur-tentacles so all three are discoverable?

Generated by the ASF Security team's threat-model tooling (Claude Opus); reviewed before opening.

@ottlinger ottlinger changed the title Add security threat model (THREAT_MODEL.md + SECURITY.md + AGENTS.md) RAT-558: Add security threat model (THREAT_MODEL.md + SECURITY.md + AGENTS.md) Jun 11, 2026
Rebased onto current master, which already added AGENTS.md and SECURITY.md. Keeps both maintainer files and adds the detailed THREAT_MODEL.md plus the AGENTS.md -> SECURITY.md -> THREAT_MODEL.md pointers.

Generated-by: Claude Opus 4.8 (1M context)
@potiuk potiuk force-pushed the asf-security/threat-model-2026-06-10 branch from 35879b0 to d4f0fdd Compare June 14, 2026 01:19
@Claudenw

Copy link
Copy Markdown
Contributor

This PR looks like it needs answers from developers before submitting.
@potiuk is that the expected flow? We answer questions and push the resulting file(s) to the repo?

@potiuk

potiuk commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Yes. Absolutely. - it's enough we just comment in the PR answering the questions and I will update the PR accordingly

Comment thread THREAT_MODEL.md
or a **Maven plugin** — always **in the developer's or CI's own process**,
never as a network service. Whisker generates license documentation; Tentacles
inspects staged release bundles. None is a server.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add a new notion here as RAT can be used to change your own sources to include license headers? In this way user input can be altered or is this not relevant from a security-scope?

Comment thread THREAT_MODEL.md
when RAT is deliberately run on your own (trusted) code — the dominant,
intended case. Findings whose only impact requires running RAT on input you
already trust are `OUT-OF-MODEL: trusted-input`.
- **Test resources** (the deliberately-odd license fixtures under

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add documentation that Maven/Ant/CLI options is generated, as this would mean that security implications are automatically transferred to all of RAT UIs?

Comment thread THREAT_MODEL.md Outdated

## §15 Appendix — existing-policy back-map

No in-repo `SECURITY.md` exists today; this PR adds one (ASF security-process

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a very basic security file was introduced via
#671

@ottlinger ottlinger requested a review from Claudenw June 16, 2026 12:40
@Claudenw

Copy link
Copy Markdown
Contributor

@potiuk

In answer to the first question, there is a PR to ensure we have this covered.
In answerto the second question, we utilize Apache Common Compress ArchiveStreamFactory class to create archive streams. If the filters indicate that we should process the archived file we extract the contents into an in-memory buffer where it is held until the document is processed. So we could hit OOM.

Incorporates the Creadur PMC's PR apache#677 review:
- archive walker confirmed unbounded (in-memory extraction) -> §9 gap + §10
- XML/DOCTYPE hardening noted as in-flight PMC PR (§14 Q3, link pending)
- documents RAT write mode (--addLicense) as trusted-input / out-of-model
- notes CLI/Ant/Maven front-ends are generated from a common core
- §15 corrected: SECURITY.md already exists (added via apache#671)

Generated-by: Claude Opus 4.8
@potiuk

potiuk commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Thanks Claudenw and ottlinger — folded your review into the draft and pushed (THREAT_MODEL.md, +53/-21):

  • Archive handling (your Q on OOM): confirmed as a gap. The unbounded in-memory extraction via Commons Compress ArchiveStreamFactory is now §9 (not a provided property), with the matching §10 downstream responsibility (sandbox / resource-limit RAT runs over untrusted archives) and a VALID-HARDENING disposition. Marked maintainer-confirmed.

  • XML/DOCTYPE (your "there is a PR to ensure we have this covered"): I read that as the XXE-hardening question (§14 Q3). I've noted a hardening PR is in flight but left §8 Gradle RAT: Add plainOutput, xmlOutput and htmlOutput options #2 tentative pending the link — could you drop the PR number here so I can cite it? Once it lands I'll flip XXE to a provided property.

  • ottlinger — write mode: good catch. RAT's --addLicense / editor mode writes headers into the audited tree; I've documented it explicitly in §2/§3 as operator-invoked on the operator's own (trusted) sources → it's OUT-OF-MODEL: trusted-input, but now stated rather than silent.

  • ottlinger — generated front-ends: added a §2 note that CLI/Ant/Maven are generated from a common option core, so a security property (or gap) in the core transfers to all three UIs.

  • ottlinger — SECURITY.md / RAT-558: Add proposals for AGENTS and SECURITY configuration - to be discussed #671: thanks, fixed the stale §15 line — it now reflects that RAT-558: Add proposals for AGENTS and SECURITY configuration - to be discussed #671 added SECURITY.md and this PR just appends the AGENTS.md -> SECURITY.md -> THREAT_MODEL.md pointer.

Still open if you have a moment (one line each is plenty): Q1 (confirm the untrusted-input case is the one to model), Q2 (RAT makes no network connections), Q5 (Whisker/Tentacles share the profile), and Q6 (want us to add the same pointer files to creadur-whisker/-tentacles, or will you?).

One note on CI: the failing "Build and analyze" (CodeQL) check is unrelated to this PR — it's a docs-only change (three .md files), so it isn't introducing or affected by that build job; looks pre-existing/flaky on the branch.

@Claudenw

Claudenw commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@potiuk There is one more point that has not been discussed. RAT allows developers to extend the matching algorithms. See https://creadur.apache.org/rat/license_def.html#Matchers

The upshot is that 3rd parties can create new matchers and use them in license checks. Matchers are different from license checks in that license checks use matchers. For example the Apache 2.0 license check uses the text, spdx and any matchers.

Matchers scan the contents of the file (as a String) looking for matches. This means that a custom matcher would have access to all text from all files that are selected for scanning.

But this is defined in the configuration and is under control of the developer using RAT.

@ottlinger

Copy link
Copy Markdown
Contributor

@potiuk the Sonarbuild does only run on specific branches/with specific PRs as the credentials are not shared among all PRs/builds due to ASF restrictions.

@Claudenw

Copy link
Copy Markdown
Contributor
  • could you drop the PR number here so I can cite it? Once it lands I'll flip XXE to a provided property.

#679 is the PR that does the XXE hardening.

Per Claudenw (PR apache#677): RAT lets operators define custom matcher classes that
see all scanned file text, but the matcher set is operator-defined config (not
attacker-supplied), so it's OUT-OF-MODEL: trusted-input — same posture as the
write mode.

Generated-by: Claude Opus 4.8 (1M context)
@potiuk

potiuk commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

Thanks Claudenw — folded the custom-matcher surface into §3: RAT lets operators define custom matcher classes that see the full text of every scanned file, but since the matcher set is operator-defined config (not attacker-supplied), a custom matcher reading scanned text is OUT-OF-MODEL: trusted-input — the same posture as the write mode. Marked maintainer-confirmed.

ottlinger — thanks for the CI clarification; that matches what we expected (the Sonar/CodeQL job is credential-gated to specific branches under ASF restrictions, so its red on this fork PR is environmental, not anything this docs-only change introduces).

Still one open item: the XXE-hardening PR number (§14 Q3) — I've left §8 #2 tentative pending it. Whenever you drop the number I'll cite it and flip XXE from "hardening in flight" to a provided property. No rush.

The remaining §14 questions (Q1 untrusted-input posture, Q2 no-network, Q5 Whisker/Tentacles profile, Q6 sibling pointer files) are still open whenever convenient — one line each is plenty.

@ottlinger

Copy link
Copy Markdown
Contributor

@potiuk - thanks again:

  • Q5:as development on Tentacles/Whisker is rather low at the moment I'd personally prefer to start with RAT and add Tentacles/Whisker later in order to reduce noise.

  • Q2: AFAIK RAT does not open network connections as it runs locally; except for downloading stuff via Maven/Ant/buildtools in order to build the artifacts locally.

Comment thread THREAT_MODEL.md
sometimes pointed at **untrusted input**: a CI job auditing an untrusted
contribution/PR, or auditing a downloaded third-party artifact. That is the
case the model cares about. *(inferred — Q1.)*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is correct

Comment thread THREAT_MODEL.md
caller invokes RAT (CLI/Ant/Maven) on a directory + a config
│ trusted invocation
read configuration (XMLConfigurationReader) ── XXE surface if config is untrusted

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#679 is the PR that does the XXE hardening. I don't know if that impacts here.

Comment thread THREAT_MODEL.md Outdated

- A JRE; RAT reads the filesystem it is pointed at and writes a report. It opens
**no network connections** and runs no services. *(inferred — Q2, the
no-network claim is high-value to confirm.)*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, no network connections are opened by RAT. RAT only opens files. One potential hole in this is XSLT transforms where the operator could add an xsl:include statement to open a connection to a remote system. This is out of scope as the XSLT are in the trusted space under control of the operator.

Comment thread THREAT_MODEL.md Outdated
**no network connections** and runs no services. *(inferred — Q2, the
no-network claim is high-value to confirm.)*
- The XML parser behaviour depends on the platform JAXP unless RAT configures it
(§5a/§8). *(inferred — Q3.)*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends upon JAXP an can be configured through the JAXP environment variables as documented:
https://docs.oracle.com/javase/8/docs/technotes/guides/security/jaxp/jaxp.html#setting-jaxp-properties-as-system-properties

Comment thread THREAT_MODEL.md
(§8/§9, maintainer-confirmed). XML-parser DOCTYPE handling is being hardened via
a PMC PR (§14 Q3). There is no "insecure default toggle". *(maintainer / Q3
pending PR link.)*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above for PR link

Comment thread THREAT_MODEL.md Outdated
| Input | Attacker-controllable? (untrusted-run) | Concern |
| --- | --- | --- |
| scanned file content | **yes** | parsed/read; resource use |
| scanned file paths / archive entry names | **yes** | path handling on archive extraction |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "path handling on archive extraction" mean? We do not extract the data into a directory. We read the files from the archive and extract them from there. The file paths are documented as relative to the archive so something like "/bar/baz.zip#/junk.txt" is reported for a file junk.txt in the archive baz.zip found in the /some/dir/bar/ directory on a unix/mac system where RAT was pointed to /some/dir as the tree to scan.

But the contents of junk.txt was only every extracted to memory.

Comment thread THREAT_MODEL.md Outdated
trusted caller; inputs are normally trusted, but the security-relevant case is
RAT auditing **untrusted** input (CI on untrusted PRs, third-party artifacts).
Is that the case you want modelled, or do you consider all RAT input trusted
(which would move XXE/archive items to `OUT-OF-MODEL: trusted-input`)? (§2/§7.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All rat configuration items (XSLT stylesheets, configuration files, license definitions, matcher implementations) are trusted and under control of the operator.

The files that are read may be untrusted, as you point out, in the case of verification of PRs from 3rd parties.

Attack surface is anything that can break out of the scanning stream when the system is run with the default settings. I am certain that there are settings that could open the system up for attack, for example the JAXP environment vars.s

Comment thread THREAT_MODEL.md Outdated
- **Q3.** *(Partially answered — PMC, PR #677: a hardening PR is in flight
ensuring DOCTYPE / external-entity handling is covered. **Pending the PR link
to cite**; once landed §8 #2 becomes a provided property.)* Does
`XMLConfigurationReader` disable DOCTYPE / external entities (XXE-safe)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

External entities are disabled.

Comment thread THREAT_MODEL.md Outdated
in-memory buffer (Commons Compress `ArchiveStreamFactory`) held until the
document is processed, so a crafted archive can OOM. Resolved as a §9 gap +
§10 responsibility; §8 #1 is **not** a provided property.)* Does
`ArchiveWalker` bound decompression (size/depth/entry-count)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not guard against OOM, we probably should add a limit but at this time we do not. This is out of scope and not provided.

Comment thread THREAT_MODEL.md Outdated
**Wave 2 — surface.**

- **Q2.** Confirm RAT makes no network connections and has no side effects beyond
reading the scanned tree and writing the report. (§5/§8.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed above.

…d (external entities disabled, apache#679 hardens DOCTYPE), no-network confirmed (XSLT xsl:include caveat), correct archive path-handling (read to memory, no extract-to-disk → no path traversal), Whisker/Tentacles deferred
@potiuk

potiuk commented Jun 21, 2026

Copy link
Copy Markdown
Member Author

Thanks @ottlinger and @Claudenw — that's everything we needed. I've folded all your answers into THREAT_MODEL.md (pushed). Summary of what changed:

  • Trust posture (Q1) — confirmed and now (maintainer): RAT config (XSLT, config files, license defs, custom matchers) is operator-trusted; the scanned files are the untrusted surface. §2/§7.
  • No network (Q2) — confirmed (maintainer). Added your XSLT xsl:include nuance: the one operator-reachable way out is a remote xsl:include, and since stylesheets are trusted config that's OUT-OF-MODEL. §5/§8 Upgrade Apache Commons Collections to v3.2.2 #3.
  • XXE (Q3) — external entities are disabled, so §8 Gradle RAT: Add plainOutput, xmlOutput and htmlOutput options #2 is now a provided property (was tentative); noted PR RAT-560: changes to reduce XXE exposure #679 as the DOCTYPE-hardening follow-up, and the JAXP-system-properties configurability. §5/§5a/§8.
  • Archive bound (Q4) — kept as a disclaimed §9 gap (no bound, OOM not guarded).
  • Path handling — corrected a phantom risk: since RAT reads entries into memory and never extracts to disk, there's no zip-slip / path-traversal-on-write surface. An entry label like bar/baz.zip#/junk.txt is just a report string. §6/§9.
  • Whisker/Tentacles (Q5/Q6) — scoped this PR to creadur-rat per your preference; the sibling pointer files are a deferred follow-up.

With every §14 question answered, the model is ready to ratify whenever the PMC's happy with it. (The red check is the CodeQL "Build and analyze" job, which is unrelated to these doc-only changes — all 13 build/test matrix jobs pass.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants