Harden daemon IPC/transport correctness and DoS sizing#314
Merged
Conversation
- CmdCancel: implement real cancellation via conn.cancelAllDials (no-op before); abort the client's in-flight dials without closing the conn. Reconcile IPC wire-format docs with reality (single cmd byte, no reqID). - probeUDPReachable: assert reachability only on a real beacon discover reply; treat no-reply (silent UDP blackhole) as not reachable so UDP-blocked nets auto-switch to compat. Gate the switch on a configured compat beacon to avoid stranding the daemon. - MaxTotalConnections: fix comment (65536, not 4096) and document the DoS sizing and why the per-client/per-port caps are the real defence. - Validate registry network ids before the uint16 cast at all sites so a hostile/malformed payload cannot wrap and remap policy/network identity. - Make health a purely-local check via HealthSnapshot; registry latency no longer degrades or stalls the local health probe. - Log dropped IPC stream/datagram send errors instead of swallowing them. - Pending key-exchange queue: drop NEWEST when full to preserve the ordered setup prefix (first bytes); retransmit recovers dropped tail. Add tests covering each fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes eight daemon-layer correctness/DoS/robustness issues in
pkg/daemon/{ipc.go,daemon.go,tunnel.go}(+ routing). Each fix has a test; fullpkg/daemon/...suite is green under-race -parallel 4.Fixes
1. CmdCancel was a no-op (ipc.go).
The read loop did
if cmd == CmdCancel { continue }while comments claimed it cancelled an in-flight request by reqID — but there is no reqID on the wire. Implemented real cancellation:CmdCancelnow callsconn.cancelAllDials(), firing every in-flight dial cancel for that IPC connection without tearing the connection down, so a driver that timed out can stop the daemon grinding the 14–31 s retry budget. A per-request cancel is impossible without a wire change (no correlation token), so it necessarily cancels all of the client's in-flight dials — which matches the cmd's intent (a driver only sends it after giving up).Close()now shares the same helper. All misleading "cmdCancel" comments in the dial handler reconciled with reality.2. Misleading request-ID docs (ipc.go).
The wire-format doc claimed
[uint32-len][uint8-cmd][uint64-reqID][payload]with reqID demux; the real frame is[uint32-len][uint8-cmd][payload](IPCEnvelopeHeaderSize == 1). Rewrote the doc to match, noted the threadedreqIDparam is always 0 and not transmitted, and added aTODO(ipc-correlation)describing a backward-compatible token scheme. No wire change (the client-side mitigation lives incommon).3. UDP-blocked nets not auto-switching to compat (daemon.go + routing).
probeUDPReachablesent a junk byte and returnedtrueon timeout, so silently-blackholed UDP (the common UDP-blocked corporate case, which sends no ICMP reject) was treated as reachable and compat fallback never fired. Rewrote it to send a realBeaconMsgDiscover(viarouting.ProbeBeaconRTT) and assert reachability only on a valid reply within a 1.5 s window; no reply = no evidence of UDP = not reachable. This won't false-positive on quiet-but-working links because the beacon always answers a discover on a functioning path (it's the same cold-start STUN exchange the daemon already depends on). Gated the auto-switch on a configured compat beacon so an unblocked-but-misconfigured daemon isn't stranded.4. Conn-limit comment vs reality (daemon.go). DECISION: keep 65536.
Comment said
default 4096; actualDefaultMaxTotalConnections = 65536. Kept 65536 — it is intentional for registry-scale daemons (tens of thousands of identities) and is not the first DoS line:MaxConnsPerIPCClient(4096) andMaxIPCClients(1024) gate per-client/per-socket growth andMaxConnectionsPerPort(1024) bounds inbound fan-in, so no single client can reach it. Fixed the comment and documented the DoS sizing rationale (lowerMaxTotalConnectionsin Config on memory-constrained hosts).5. Network-ID uint16 casts without bounds (daemon.go).
Registry/bus network ids were cast
uint16(float64)without validation, so e.g.id: 65537wrapped to1and could remap policy/rules onto another network. AddednetworkIDFromJSON/networkIDInRange(rejects NaN, negative, fractional, out-of-range) and applied at every site: rules load, expr-policy load,nodeNetworksFresh, and the join-eventlistByIDmap.networkIDFromBusPayloadnow range-checks int/int64/float64 forms.6. health not purely local (ipc.go + daemon.go).
handleHealthcalledInfo()→nodeNetworks()→regConn.Lookup, so registry latency degraded the local health probe. AddedDaemon.HealthSnapshot()that reads only in-memory state (no registry) and switched the handler to it.7. Stream send errors dropped (ipc.go).
handleSend/handleSendTodiscardedSendData/SendDatagramerrors with_ =. These stay fire-and-forget on the wire (an error reply would corrupt the driver's pending channel), but the errors are now logged so silent stream/datagram send failures are observable.8. Pending key-exchange queue dropped oldest (tunnel.go). DECISION: drop newest.
flushPendingreplays FIFO, so the oldest queued frames are the connection-setup prefix (SYN, first app bytes). Dropping the head strands the receiver's in-order reassembly; dropping the tail is recovered by the transport's retransmit (dup-ack/RTO). Switched the full-queue policy to drop the newest (reject the incoming packet) to preserve the ordered prefix. Caller still gets the typedErrPendingDropped;PendingDropsaccounting and log throttling unchanged. Updated the error message and all related comments.Deferrals
commonhandles the client-side mitigation).allowed_ports, daemon.go ~1790): left as-is — ports are 16-bit by spec, out of this item's network-ID scope.Validation
GOWORK=off go build ./...clean;go vet ./pkg/daemon/...clean;gofmtclean.GOWORK=off go test -race -parallel 4 ./pkg/daemon/...green (the one macOS failure,TestWriteLoopExitsOnWriteDeadline, is the known unix-socket-path-length temp flake — passes withTMPDIR=/tmp; Linux is the gate).🤖 Generated with Claude Code