Skip to content

Remove bootstrap nodes and replace with operator peers#3909

Open
lrsaturnino wants to merge 25 commits intomainfrom
feature/decouple-firewall-allowlist
Open

Remove bootstrap nodes and replace with operator peers#3909
lrsaturnino wants to merge 25 commits intomainfrom
feature/decouple-firewall-allowlist

Conversation

@lrsaturnino
Copy link
Copy Markdown
Member

@lrsaturnino lrsaturnino commented Mar 25, 2026

Problem

The network relies on centrally-managed bootstrap nodes for initial peer discovery. Their embedded public keys bypass firewall IsRecognized() staking checks via the AllowList, meaning an unstaked or slashed embedded peer retains permanent network access. Operators hardcoding bootstrap addresses in --network.peers will lose connectivity when bootstrap infrastructure is decommissioned.

Solution

  • Replace bootstrap node entries with operator peers (/dns4/ or /ip4/ format)
  • Decouple firewall AllowList from discovery peers — pass firewall.EmptyAllowList so all peers pass IsRecognized() staking checks
  • Remove dead ExtractPeersPublicKeys function
  • Deprecate --network.bootstrap flag with runtime warning
  • Rename connected_bootstrap_count metric to connected_wellknown_peers_count

Tests

  • TestValidate_EmptyAllowList_RecognizedPeerAccepted — recognized peer passes via IsRecognized path
  • TestValidate_EmptyAllowList_UnrecognizedPeerRejected — unrecognized peer rejected with no AllowList bypass
  • TestValidate_EmptyAllowList_PreviouslyAllowlistedPeerMustPassIsRecognized — previously allowlisted peer no longer bypasses checks
  • TestResolvePeers — updated expectations for new operator peer entries (mainnet and testnet)
  • TestNetworkBootstrapFlagDescription_ContainsDeprecationNotice — flag description includes deprecation text
  • TestIsBootstrap — returns correct boolean value
  • TestConnectedWellknownPeersCountMetricName — metric constant has correct value
  • TestObserveConnectedWellknownPeersCount_Callable — renamed function exists and executes without panic

Replace all Boar bootstrap node entries with curated DNS-backed operator
peers to complete the bootstrap infrastructure decommission. This is the
final phase — Staked nodes were removed in v2.5.1.

Peer list changes:
- Mainnet: 2 Boar entries replaced with 5 operator peers at keep-nodes.io
- Testnet: 1 Boar entry replaced with 2 operator peers at test.keep-nodes.io
- All entries use /dns4/ format for IP change resilience

Security — AllowList decoupling:
- Pass firewall.EmptyAllowList instead of extracting embedded peer keys
- All peers (including embedded operators) now pass IsRecognized() staking
  checks with no firewall bypass
- Remove dead ExtractPeersPublicKeys function and its tests

Deprecations and renames:
- Deprecate --network.bootstrap flag with runtime warning
- Rename connected_bootstrap_count metric to connected_wellknown_peers_count

Documentation:
- Add operator migration guide covering Boar address removal,
  --network.peers override behavior, and monitoring updates
@lrsaturnino lrsaturnino changed the title Remove Boar bootstrap nodes and replace with operator peers Remove bootstrap nodes and replace with operator peers Mar 25, 2026
Operator migration guidance will be distributed separately from the
code release.
Replace placeholder mainnet entries with the beta staker node
(143.198.18.229:3919) as the sole embedded peer for initial testing.
Testnet placeholders remain until operator coordination is complete.
@lrsaturnino lrsaturnino marked this pull request as ready for review March 25, 2026 04:53
gotestsum's default `./...` package pattern only applies when no args
are passed after `--`. With `-- -timeout 15m`, it forwards args directly
to `go test`, which defaults to `.` (root package only). The root
package has no test files, so CI has been silently running 0 tests.
Replace placeholder hostnames with actual values from config/_peers/testnet.
Remove TestConnectedWellknownPeersCountMetricName and TestMetricConstants
which only assert that string constants equal themselves. The compiler
already ensures rename safety. Keep the callable integration test which
validates the function exists and executes without panicking.
…d var

Change EmptyAllowList from an exported mutable package-level var to an
exported function returning the package-level singleton. This prevents
external code from accidentally mutating the shared empty allowlist.
Add a note that connected_wellknown_peers_count was previously named
connected_bootstrap_count, so operators can update Prometheus queries
and Grafana dashboards accordingly.
Specify concrete removal version so the deprecated flag does not linger
indefinitely.
Add a TODO comment noting that at least one additional mainnet peer
across a different operator/ASN should be added before production
rollout to avoid a single point of failure for initial peer discovery.
Go 1.24 vet rejects non-constant format strings in fmt.Errorf. This
pre-existing issue was hidden because CI was not running tests.
lionakhnazarov
lionakhnazarov previously approved these changes Mar 25, 2026
Copy link
Copy Markdown
Collaborator

@lionakhnazarov lionakhnazarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

piotr-roslaniec and others added 2 commits March 26, 2026 09:40
## Summary

- CI has been silently running **0 tests** because `gotestsum --
-timeout 15m` (without `./...`) only tests the root package, which has
no test files
- Fix: add explicit `./...` so all subpackages are tested
- Also fix `peers_test.go` placeholder hostnames to match actual
`config/_peers/testnet` values — this test failure was hidden by the
above bug

## Root cause

`gotestsum`'s default `./...` package pattern only applies when **no
args** are passed after `--`. With `-- -timeout 15m`, gotestsum forwards
args directly to `go test`, which defaults to `.` (current directory
only). CI log confirms: `DONE 0 tests in 8.107s`.

## Test plan

- [ ] CI should now run all Go tests (expect 100+ tests instead of 0)
- [ ] `TestResolvePeers/sepolia_network` should pass with corrected
hostnames


🤖 Generated with [Claude Code](https://claude.com/claude-code)
piotr-roslaniec and others added 10 commits April 2, 2026 07:57
gotestsum's default `./...` package pattern only applies when no args
are passed after `--`. With `-- -timeout 15m`, it forwards args directly
to `go test`, which defaults to `.` (root package only). The root
package has no test files, so CI has been silently running 0 tests.
Replace placeholder hostnames with actual values from config/_peers/testnet.
Remove TestConnectedWellknownPeersCountMetricName and TestMetricConstants
which only assert that string constants equal themselves. The compiler
already ensures rename safety. Keep the callable integration test which
validates the function exists and executes without panicking.
…d var

Change EmptyAllowList from an exported mutable package-level var to an
exported function returning the package-level singleton. This prevents
external code from accidentally mutating the shared empty allowlist.
Add a note that connected_wellknown_peers_count was previously named
connected_bootstrap_count, so operators can update Prometheus queries
and Grafana dashboards accordingly.
Specify concrete removal version so the deprecated flag does not linger
indefinitely.
Add a TODO comment noting that at least one additional mainnet peer
across a different operator/ASN should be added before production
rollout to avoid a single point of failure for initial peer discovery.
Go 1.24 vet rejects non-constant format strings in fmt.Errorf. This
pre-existing issue was hidden because CI was not running tests.
## Summary

- CI has been silently running **0 tests** because `gotestsum --
-timeout 15m` (without `./...`) only tests the root package, which has
no test files
- Fix: add explicit `./...` so all subpackages are tested
- Also fix `peers_test.go` placeholder hostnames to match actual
`config/_peers/testnet` values — this test failure was hidden by the
above bug

## Root cause

`gotestsum`'s default `./...` package pattern only applies when **no
args** are passed after `--`. With `-- -timeout 15m`, gotestsum forwards
args directly to `go test`, which defaults to `.` (current directory
only). CI log confirms: `DONE 0 tests in 8.107s`.

## Test plan

- [ ] CI should now run all Go tests (expect 100+ tests instead of 0)
- [ ] `TestResolvePeers/sepolia_network` should pass with corrected
hostnames


🤖 Generated with [Claude Code](https://claude.com/claude-code)
@piotr-roslaniec piotr-roslaniec force-pushed the feature/decouple-firewall-allowlist branch from 16ace1b to 5a6bb93 Compare April 2, 2026 06:04
piotr-roslaniec
piotr-roslaniec previously approved these changes Apr 2, 2026
The count() loop could call close(watcher.channel) on consecutive ticks
before the WatchBlocks cleanup goroutine removed the cancelled watcher
from the list, causing a "close of closed channel" panic. Use sync.Once
to guarantee the channel is closed exactly once.
@pdyraga
Copy link
Copy Markdown
Member

pdyraga commented Apr 7, 2026

I strongly advise against this change.

Bootstrap nodes in the tBTC client are not just an address book. They work in a relay mode, improving the dissemination of network messages. Given the specificity of the tECDSA algorithm used by tBTC nodes, the delivery of messages on time is essential, and messages are dropped by nodes when CPU load is high executing tECDSA steps. This is how libp2p floodsub works, and this is how we designed message delivery buffers in the client. Another symptom that is clearly visible in metrics is the number of connections being dropped when executing CPU-intensive tECDSA steps.

Bootstrap nodes work on a high-availability infrastructure, do not participate in tECDSA, and provide a stable entrance point to the network. If several operator nodes lose their in-memory address books for any reason (could be, for example, a high CPU load leading to a restart or panic in the algorithm), they will find HA bootstrap immediately and recover the peer list in a matter of seconds.

Eliminating bootstrap nodes should be preceded by changing the network layer design, probably moving to gossip mode, and performing load tests on a larger network resembling mainnet conditions. We made an attempt at this architecture change in the past, and it was not a trivial endeavour.

@nkuba
Copy link
Copy Markdown
Member

nkuba commented Apr 7, 2026

I agree with @pdyraga here.

The current network architecture has been battle-tested over a long period of time. It went through months of iterative testing on testnet and then further validation in production. This setup is not accidental, it reflects a lot of learnings around reliability and behavior under load.

Bootstrap nodes seem to play a role beyond just peer discovery, especially when it comes to message propagation and overall network stability during heavy tECDSA workloads. Removing them without a proven alternative introduces a real risk.

I strongly recommend against changing such a critical part of the system without thorough validation. At minimum, this should be backed by extensive testing and observation in controlled environments and then gradual rollout with production monitoring.

It’s also worth noting that the testnet is no longer maintained and has effectively been shut down, which makes it even harder to properly validate changes like this before exposing them to mainnet conditions.

In short, I’m in favor of the security improvements, but removing bootstrap infrastructure at this stage feels premature without stronger evidence and testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants