Skip to content

consensus: rejoin a lagging archive by syncing proposals from peers#570

Merged
CassOnMars merged 1 commit into
QuilibriumNetwork:v2.1.0.24from
dazthecorgi:fix-archive-consensus-catch-up
Jun 18, 2026
Merged

consensus: rejoin a lagging archive by syncing proposals from peers#570
CassOnMars merged 1 commit into
QuilibriumNetwork:v2.1.0.24from
dazthecorgi:fix-archive-consensus-catch-up

Conversation

@dazthecorgi

Copy link
Copy Markdown
Contributor

A global-consensus participant that falls behind — e.g. an archive isolated by a network partition — could not rejoin consensus. Orphan resolution and finalized-rank advancement happen only on the consensus message path (on_receive_proposal -> forks.add_validated_state -> drain orphans); a plain frame-store write never triggers them, and freshness gates drop a lagging node's traffic until its finalized rank advances. With no path to backfill the missed proposals, the node would orphan every later proposal forever — its frame store could catch up (via the read-only poller) while its consensus engine stayed permanently stuck.

This ports the Go node's catch-up path (SyncProvider / GetGlobalProposal -> AddProposal): when the engine orphans a proposal for a missing parent, pull the missing proposals from a peer and submit them into the consensus loop so the node finalizes them and resumes voting — rather than merely mirroring frames.

Changes:

  • Serve full proposals: implement GlobalService.GetGlobalProposal, assembling state + parent QC + prior TC + proposer vote from the clock store (FrameLookup::get_global_proposal). Previously stubbed to return nothing for non-genesis frames.
  • Persist the proposer vote at proposal ingest (keyed filter,rank,selector) so it can be served back; the store trait had the accessor but no producer wrote it. Add ProposalVote/TimeoutCertificate proto<->wire conversions and proto_proposal_to_signed (only QC had one).
  • Trigger: add Consumer::on_missing_parent, fired at the orphan-cache site, surfaced through GlobalConsumer and a required SyncTriggerHook on ConsensusActivationParams — required (not optional) so the recovery path cannot be silently left unwired.
  • Catch-up task (node): a Notify-driven task pulls proposals via ArchiveClient::get_global_proposal ascending from the engine's finalized frame (tracked separately from the poller-shared head) and submits them (submit_quorum_certificate / submit_timeout_certificate / submit_proposal).

Verification: the devnet rank-1 partition scenario (archive-4 isolated) now passes with the victim genuinely rejoining consensus — it orphans during the partition, then on heal the catch-up supplies the one missing parent frame and the engine finalizes frames 1..4 (state finalized, persisted candidate frames), orphaning stops, and 4/4 nodes reach the stop frame with the chain-safety check clean. Before this change the victim stayed at finalized=0, orphaning every rank.

A global-consensus participant that falls behind — e.g. an archive isolated by
a network partition — could not rejoin consensus. Orphan resolution and
finalized-rank advancement happen only on the consensus message path
(on_receive_proposal -> forks.add_validated_state -> drain orphans); a plain
frame-store write never triggers them, and freshness gates drop a lagging
node's traffic until its finalized rank advances. With no path to backfill the
missed proposals, the node would orphan every later proposal forever — its
frame store could catch up (via the read-only poller) while its consensus
engine stayed permanently stuck.

This ports the Go node's catch-up path (SyncProvider / GetGlobalProposal ->
AddProposal): when the engine orphans a proposal for a missing parent, pull the
missing proposals from a peer and submit them into the consensus loop so the
node finalizes them and resumes voting — rather than merely mirroring frames.

Changes:
- Serve full proposals: implement GlobalService.GetGlobalProposal, assembling
  state + parent QC + prior TC + proposer vote from the clock store
  (FrameLookup::get_global_proposal). Previously stubbed to return nothing for
  non-genesis frames.
- Persist the proposer vote at proposal ingest (keyed filter,rank,selector) so
  it can be served back; the store trait had the accessor but no producer wrote
  it. Add ProposalVote/TimeoutCertificate proto<->wire conversions and
  proto_proposal_to_signed (only QC had one).
- Trigger: add Consumer::on_missing_parent, fired at the orphan-cache site,
  surfaced through GlobalConsumer and a required SyncTriggerHook on
  ConsensusActivationParams — required (not optional) so the recovery path
  cannot be silently left unwired.
- Catch-up task (node): a Notify-driven task pulls proposals via
  ArchiveClient::get_global_proposal ascending from the engine's finalized
  frame (tracked separately from the poller-shared head) and submits them
  (submit_quorum_certificate / submit_timeout_certificate / submit_proposal).

Verification: the devnet rank-1 partition scenario (archive-4 isolated) now
passes with the victim genuinely rejoining consensus — it orphans during the
partition, then on heal the catch-up supplies the one missing parent frame and
the engine finalizes frames 1..4 (state finalized, persisted candidate frames),
orphaning stops, and 4/4 nodes reach the stop frame with the chain-safety check
clean. Before this change the victim stayed at finalized=0, orphaning every
rank.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@CassOnMars CassOnMars merged commit f7753c3 into QuilibriumNetwork:v2.1.0.24 Jun 18, 2026
5 checks passed
CassOnMars added a commit that referenced this pull request Jun 29, 2026
* current state of v2.1.0.24

* add more test coverage, fix patch number sync with config, fix race where initial failout of sync dooms workers to idle forever (until reboot)

* fix tests (#566)

* fix test compilation

* fix keypair_from_protobuf_encoding test

* re-enable ed448-rust tests

* enable pacemaker_advances_on_qc test

* fix transaction safety for hypergraph store writes (#567)

* fix vertex data roundtrip test and add new one

* fix: use tx when saving vertex data

Step 1 — Failing regression test (TDD)
Added get_vertex_data_not_visible_when_commit_txn_aborted to
crates/quil-rpc/tests/vertex_data_end_to_end.rs. It drives the real
lazy-commit path into a transaction, aborts it, and asserts GetVertexData
returns nothing. Confirmed it failed on current code before any fix.

Step 2 — save_vertex_underlying is now transaction-aware
- crates/quil-types/src/store.rs: added txn: &dyn Transaction to the trait
  method.
- crates/quil-store/src/hypergraph.rs: added inherent
  save_vertex_underlying_txn / save_tree_blob_txn using the existing
  with_rocks_batch pattern (stage into the batch, fall back to a direct
  write for unrecognized txns); the trait impl delegates to the txn
  variant. Kept the original inherent direct-write methods intact so the
  ~25 concrete-typed callers are untouched.

Step 3 — Threaded txn through the lazy-tree commit
crates/quil-tries/src/lazy_tree.rs: walk_leaves_persist now takes txn,
passes it recursively and into save_vertex_underlying; the call site passes
the txn that commit already receives. Leaf writes now join the same batch
as the tree nodes and shard commit.

Step 4 — Sync-apply path made atomic
crates/quil-rpc/src/hypergraph_sync_probe.rs: extracted a
persist_shard_refresh helper that wraps the tree-blob write and the
per-vertex loop in a single transaction (new_transaction → writes →
commit); on any error the txn is dropped, so partial syncs don't persist.
Used in both the fresh and incremental paths.

Step 5 — Implementors updated & verified
Added the _txn param to the four other HypergraphStore impls (BTreeStore,
MemStore, E2EHgStore, the quil-execution test store).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix: make LazyVectorCommitmentTree::commit retry-safe

Make LazyVectorCommitmentTree::commit retry-safe by deferring dirty-state
clearing until the surrounding transaction is durably committed.

Problem: commit drained its dirty-node map and pending_deletions and
cleared dirty_flag before returning — i.e. before the caller's
txn.commit(). If the txn was aborted/failed/never committed, the store
stayed unchanged while the in-memory tree (which is reused across frames)
believed it had persisted, so a retry silently skipped re-staging the
missing writes.

Fix:
- crates/quil-tries/src/lazy_tree.rs: commit now takes a non-draining
  snapshot of the dirty set and pending_deletions to stage into the txn,
  and no longer clears dirty_flag. Removed the dead latest map. Added
  mark_persisted() — the sole place dirty bookkeeping is cleared; must be
  called only after txn.commit() succeeds. Idempotent.
- crates/quil-hypergraph/src/crdt.rs: HypergraphCRDT::commit collects the
  trees it staged and calls mark_persisted() on each only after
  txn.commit()? succeeds. A failed commit returns early, leaving every
  tree dirty for a safe retry.
- Side benefit: compute_shard_root (which commits under a NoopTxn) no
  longer clears dirty state, so the subsequent real commit re-stages those
  nodes transactionally.

Tests:
- crates/quil-tries/tests/golden_lazy_tree.rs — added
  commit_defers_dirty_clear_until_mark_persisted: proves the tree stays
  dirty after commit, that a retry after a discarded txn re-stages node
  writes, and that a clean tree (post mark_persisted) stages nothing.
  Added kv_len/clear_kv test helpers.
- crates/quil-rpc/tests/vertex_data_end_to_end.rs — folded the is_dirty()
  lifecycle into the existing real-RocksDB commit-path tests:
  - success path now asserts dirty after txn.commit(), clean after
    mark_persisted();
  - abort path now asserts the tree stays dirty (retry-safe).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix: make compute_shard_root read-only

compute_shard_root claimed to read "without writing anything to the store"
but computed its root via tree.commit(&NoopTxn, ...). Since NoopTxn isn't a
RocksTxn, every write inside commit (tree nodes, deletions, and vertex
underlying blobs) fell through to a direct db.put, persisting data outside
any frame transaction for an uncommitted frame. This runs on the hot
master-stream receive path (maybe_sync_before_global_frame), so a routine
root comparison leaked writes to disk.

- quil-tries: extract LazyVectorCommitmentTree::compute_root(prover), the
  read-only half of commit — recomputes commitments in memory, returns the
  root, writes nothing and touches no dirty/deletion/dirty-flag state.
  Refactor commit to delegate to it.
- quil-hypergraph: compute_shard_root now calls tree.compute_root(); deleted
  the inline NoopTxn shim and the doubled doc comment.
- Add quil-hypergraph test compute_shard_root_is_read_only (with MemStore
  node_count/per_vertex_count accessors and a dirty-state helper): asserts a
  real root is returned, nothing is persisted, the tree stays dirty, and the
  root matches what a real commit then produces.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor: require RocksTxn for hypergraph store writes

With compute_shard_root no longer passing a NoopTxn, no caller ever hands a
non-RocksTxn to a RocksHypergraphStore write, so the silent direct-write
fallback only served to mask bugs like the one compute_shard_root hit.

Remove the with_rocks_batch helper. Replace it with
RocksTxn::from_dyn(&dyn Transaction) -> Result<&RocksTxn>, which downcasts
once and errors loudly on an unrecognized txn instead of writing outside the
transaction. Rewrite all writers (save_tree_blob_txn, save_vertex_underlying_txn,
insert_node, save_root, delete_node, set_shard_commit, set_alt_shard_commit,
track_change, untrack_change) to operate on the concrete RocksTxn batch
directly. The HypergraphStore trait signatures stay &dyn Transaction for
object-safety (five implementors, used as Arc<dyn HypergraphStore>).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* handle leaving scenario with store wipe

* reduce score differential basis for flagging leave-to-join opportunities, extend scoring-based leave window to be a full cycle

* adjust margins on decisions

* adjust threshold for decides and joins

* adjust snapshotting to use actual rocksdb snapshots

* resolve unsynced leave issuance condition

* reapply docker build optimizations to Dockerfile.source (#568)

Commit 1bdc940 ("current state of v2.1.0.24") accidentally reverted the
build optimizations introduced in PR #551 while landing unrelated changes
(ubuntu:22.04 base, upstream protoc, the build-migrate-tool RocksDB stage,
qns-api removal). This restores the optimizations on top of those changes:

- Consolidate the seven parallel gen-* stages (each running generate.sh)
  into a single gen-rust stage: one cargo build for all leaf crates so
  shared workspace deps compile once instead of seven times. Artifacts are
  stashed in /out/ since target/ is a cache mount, and build-context now
  COPYs --from=gen-rust accordingly.
- Restore cargo cache mounts (cargo-registry, cargo-git, cargo-target-node)
  on build-node.
- Restore Go cache mounts (go-build, go-mod) on build-qclient so a
  build-context invalidation no longer forces a from-scratch Go rebuild.
- Extend the same Go cache mounts to the new build-migrate-tool stage for
  consistency.
- Restore the explanatory comments on COPY crates and build-qclient.

This brings Dockerfile.source back in line with Dockerfile.sourceavx512,
which was untouched by the revert and shares the cargo/go cache ids.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* consensus: rejoin a lagging archive by syncing proposals from peers (#570)

A global-consensus participant that falls behind — e.g. an archive isolated by
a network partition — could not rejoin consensus. Orphan resolution and
finalized-rank advancement happen only on the consensus message path
(on_receive_proposal -> forks.add_validated_state -> drain orphans); a plain
frame-store write never triggers them, and freshness gates drop a lagging
node's traffic until its finalized rank advances. With no path to backfill the
missed proposals, the node would orphan every later proposal forever — its
frame store could catch up (via the read-only poller) while its consensus
engine stayed permanently stuck.

This ports the Go node's catch-up path (SyncProvider / GetGlobalProposal ->
AddProposal): when the engine orphans a proposal for a missing parent, pull the
missing proposals from a peer and submit them into the consensus loop so the
node finalizes them and resumes voting — rather than merely mirroring frames.

Changes:
- Serve full proposals: implement GlobalService.GetGlobalProposal, assembling
  state + parent QC + prior TC + proposer vote from the clock store
  (FrameLookup::get_global_proposal). Previously stubbed to return nothing for
  non-genesis frames.
- Persist the proposer vote at proposal ingest (keyed filter,rank,selector) so
  it can be served back; the store trait had the accessor but no producer wrote
  it. Add ProposalVote/TimeoutCertificate proto<->wire conversions and
  proto_proposal_to_signed (only QC had one).
- Trigger: add Consumer::on_missing_parent, fired at the orphan-cache site,
  surfaced through GlobalConsumer and a required SyncTriggerHook on
  ConsensusActivationParams — required (not optional) so the recovery path
  cannot be silently left unwired.
- Catch-up task (node): a Notify-driven task pulls proposals via
  ArchiveClient::get_global_proposal ascending from the engine's finalized
  frame (tracked separately from the poller-shared head) and submits them
  (submit_quorum_certificate / submit_timeout_certificate / submit_proposal).

Verification: the devnet rank-1 partition scenario (archive-4 isolated) now
passes with the victim genuinely rejoining consensus — it orphans during the
partition, then on heal the catch-up supplies the one missing parent frame and
the engine finalizes frames 1..4 (state finalized, persisted candidate frames),
orphaning stops, and 4/4 nodes reach the stop frame with the chain-safety check
clean. Before this change the victim stayed at finalized=0, orphaning every
rank.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Daz <daz_the_corgi@proton.me>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dazthecorgi added a commit to dazthecorgi/monorepo that referenced this pull request Jul 1, 2026
Re-derives the devnet integration-test harness (crates/devnet orchestrator +
proxy) and its additive hooks onto the v2.1.0.25 base (709a20b), replacing the
prior long-lived devnet fork that re-implemented v2.1.0.24 independently. The
base already contains the equivalent of all the fork's pre-devnet commits
(v2.1.0.24/25 + memory + OOM work, the #56x/QuilibriumNetwork#570 PRs, and the leave/join tuning),
so only the genuinely-new "add devnet" delta is carried forward.

Additive hooks (re-applied onto the base's versions):
- quil-p2p behaviour.rs/node.rs: per-(source,target) gossip forward_filter +
  P2PCommand::SetForwardFilter / P2PHandle::set_forward_filter.
- quil-rpc frame_sync.rs: parameterize ArchiveEndpointPool blacklist TTL
  (Duration::ZERO disables banning for instant partition recovery).
- quil-config engine.rs/lib.rs: archiveBlacklistTtl engine field (-1 disables).
- quil-node master_node/mod.rs: wire the configured TTL into the pool.
- docker/Dockerfile.source: build-devnet-proxy / build-p2p-ping / devnet-proxy
  stages + p2p-ping in the node image.
- proxy consensus_events.rs test: ProposalVote gained the base's `openings` field.

Verified: full workspace builds; unit tests green (quil-config TTL, quil-rpc
pool_tests incl. zero-TTL, quil-p2p forward_filter_tests, devnet + devnet-proxy);
fmt (devnet crates only) + clippy --no-deps clean.

KNOWN FOLLOW-UP: the e2e docker partition test does not yet pass on this base.
v2.1.0.25 moved global consensus from blossomsub gossip to direct point-to-point
gRPC (direct_global_consensus_publisher / submit_global_consensus), so the proxy's
gossip-snooping no longer observes consensus frames and times out. The proxy must
be reworked to detect the stop frame (and impose global-consensus partitions) on
the gRPC submit_global_consensus path instead of gossip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants