Skip to content

Harden daemon IPC/transport correctness and DoS sizing#314

Merged
TeoSlayer merged 1 commit into
mainfrom
fix/daemon-correctness-dos-robustness
Jun 22, 2026
Merged

Harden daemon IPC/transport correctness and DoS sizing#314
TeoSlayer merged 1 commit into
mainfrom
fix/daemon-correctness-dos-robustness

Conversation

@TeoSlayer

Copy link
Copy Markdown
Collaborator

Fixes eight daemon-layer correctness/DoS/robustness issues in pkg/daemon/{ipc.go,daemon.go,tunnel.go} (+ routing). Each fix has a test; full pkg/daemon/... suite is green under -race -parallel 4.

Fixes

1. CmdCancel was a no-op (ipc.go).
The read loop did if cmd == CmdCancel { continue } while comments claimed it cancelled an in-flight request by reqID — but there is no reqID on the wire. Implemented real cancellation: CmdCancel now calls conn.cancelAllDials(), firing every in-flight dial cancel for that IPC connection without tearing the connection down, so a driver that timed out can stop the daemon grinding the 14–31 s retry budget. A per-request cancel is impossible without a wire change (no correlation token), so it necessarily cancels all of the client's in-flight dials — which matches the cmd's intent (a driver only sends it after giving up). Close() now shares the same helper. All misleading "cmdCancel" comments in the dial handler reconciled with reality.

2. Misleading request-ID docs (ipc.go).
The wire-format doc claimed [uint32-len][uint8-cmd][uint64-reqID][payload] with reqID demux; the real frame is [uint32-len][uint8-cmd][payload] (IPCEnvelopeHeaderSize == 1). Rewrote the doc to match, noted the threaded reqID param is always 0 and not transmitted, and added a TODO(ipc-correlation) describing a backward-compatible token scheme. No wire change (the client-side mitigation lives in common).

3. UDP-blocked nets not auto-switching to compat (daemon.go + routing).
probeUDPReachable sent a junk byte and returned true on timeout, so silently-blackholed UDP (the common UDP-blocked corporate case, which sends no ICMP reject) was treated as reachable and compat fallback never fired. Rewrote it to send a real BeaconMsgDiscover (via routing.ProbeBeaconRTT) and assert reachability only on a valid reply within a 1.5 s window; no reply = no evidence of UDP = not reachable. This won't false-positive on quiet-but-working links because the beacon always answers a discover on a functioning path (it's the same cold-start STUN exchange the daemon already depends on). Gated the auto-switch on a configured compat beacon so an unblocked-but-misconfigured daemon isn't stranded.

4. Conn-limit comment vs reality (daemon.go). DECISION: keep 65536.
Comment said default 4096; actual DefaultMaxTotalConnections = 65536. Kept 65536 — it is intentional for registry-scale daemons (tens of thousands of identities) and is not the first DoS line: MaxConnsPerIPCClient (4096) and MaxIPCClients (1024) gate per-client/per-socket growth and MaxConnectionsPerPort (1024) bounds inbound fan-in, so no single client can reach it. Fixed the comment and documented the DoS sizing rationale (lower MaxTotalConnections in Config on memory-constrained hosts).

5. Network-ID uint16 casts without bounds (daemon.go).
Registry/bus network ids were cast uint16(float64) without validation, so e.g. id: 65537 wrapped to 1 and could remap policy/rules onto another network. Added networkIDFromJSON / networkIDInRange (rejects NaN, negative, fractional, out-of-range) and applied at every site: rules load, expr-policy load, nodeNetworksFresh, and the join-event listByID map. networkIDFromBusPayload now range-checks int/int64/float64 forms.

6. health not purely local (ipc.go + daemon.go).
handleHealth called Info()nodeNetworks()regConn.Lookup, so registry latency degraded the local health probe. Added Daemon.HealthSnapshot() that reads only in-memory state (no registry) and switched the handler to it.

7. Stream send errors dropped (ipc.go).
handleSend/handleSendTo discarded SendData/SendDatagram errors with _ =. These stay fire-and-forget on the wire (an error reply would corrupt the driver's pending channel), but the errors are now logged so silent stream/datagram send failures are observable.

8. Pending key-exchange queue dropped oldest (tunnel.go). DECISION: drop newest.
flushPending replays FIFO, so the oldest queued frames are the connection-setup prefix (SYN, first app bytes). Dropping the head strands the receiver's in-order reassembly; dropping the tail is recovered by the transport's retransmit (dup-ack/RTO). Switched the full-queue policy to drop the newest (reject the incoming packet) to preserve the ordered prefix. Caller still gets the typed ErrPendingDropped; PendingDrops accounting and log throttling unchanged. Updated the error message and all related comments.

Deferrals

  • test: add e2e testing, coverage and pre-commits #2 correlation token: left as a documented TODO (wire change; common handles the client-side mitigation).
  • Port-list uint16 casts (allowed_ports, daemon.go ~1790): left as-is — ports are 16-bit by spec, out of this item's network-ID scope.
  • Registry plaintext default (daemon.go ~910): untouched per instructions (separate TLS-rollout design).

Validation

  • GOWORK=off go build ./... clean; go vet ./pkg/daemon/... clean; gofmt clean.
  • GOWORK=off go test -race -parallel 4 ./pkg/daemon/... green (the one macOS failure, TestWriteLoopExitsOnWriteDeadline, is the known unix-socket-path-length temp flake — passes with TMPDIR=/tmp; Linux is the gate).
  • gosec: no new findings on touched lines; gitleaks: no leaks.

🤖 Generated with Claude Code

- CmdCancel: implement real cancellation via conn.cancelAllDials (no-op
  before); abort the client's in-flight dials without closing the conn.
  Reconcile IPC wire-format docs with reality (single cmd byte, no reqID).
- probeUDPReachable: assert reachability only on a real beacon discover
  reply; treat no-reply (silent UDP blackhole) as not reachable so
  UDP-blocked nets auto-switch to compat. Gate the switch on a configured
  compat beacon to avoid stranding the daemon.
- MaxTotalConnections: fix comment (65536, not 4096) and document the DoS
  sizing and why the per-client/per-port caps are the real defence.
- Validate registry network ids before the uint16 cast at all sites so a
  hostile/malformed payload cannot wrap and remap policy/network identity.
- Make health a purely-local check via HealthSnapshot; registry latency no
  longer degrades or stalls the local health probe.
- Log dropped IPC stream/datagram send errors instead of swallowing them.
- Pending key-exchange queue: drop NEWEST when full to preserve the
  ordered setup prefix (first bytes); retransmit recovers dropped tail.

Add tests covering each fix.
@TeoSlayer TeoSlayer merged commit f0c422f into main Jun 22, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants