Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Rotations and upgrades

audience: operators

Every routine change in a running instance falls into one of these procedures. Follow them verbatim; the consensus and crypto are unforgiving about accidental divergence.

Rolling a committee server (restart, same identity)

Safe any time. Minority-restart is handled by Raft automatically.

  1. Stop the target server with SIGTERM. Wait for graceful exit (under 5 s).
  2. Replace the binary / restart the container / whatever triggered the rollout.
  3. Start the server with the same ZIPNET_INSTANCE, ZIPNET_SECRET, and ZIPNET_COMMITTEE_SECRET as before.
  4. Observe mosaik_groups_leader_is_local on the remaining servers — election should settle within a few seconds.
  5. Once the restarted server’s log shows round finalized, move to the next one.

Do not restart a majority of the committee simultaneously — that drops quorum and halts round progression until a majority is back up.

Adding a committee server

  1. Provision the new node. Assign it a fresh ZIPNET_SECRET seed.
  2. Distribute the same ZIPNET_INSTANCE and ZIPNET_COMMITTEE_SECRET to it.
  3. Start it with --bootstrap <peer_id_of_any_existing_server>.
  4. Wait for the new server’s log to print round finalized — it has caught up.
  5. Update your operational runbook, monitoring targets, and audit log to reflect the added node.

The ServerRegistry collection automatically reflects the new member within one round. Clients start including the new server in their pad derivation from the next OpenRound the leader issues.

Removing a committee server

  1. Announce the removal at least one gossip cycle ahead (default 15 s) so catalog entries expire cleanly.
  2. SIGTERM the target node.
  3. Verify the remaining servers still form a majority and continue to finalize rounds (round finalized events in the logs).

Security warning

A removed server retains its DH secret. If that secret is not wiped, an adversary who later compromises the decommissioned machine can replay historic rounds and compute that server’s share of past pads. Combined with any other committee server’s DH secret compromise, this would break anonymity of past rounds. Wipe DH secrets on decommission.

Rotating a committee server’s long-term key

v1 does not have first-class key rotation. The procedure is “decommission + re-add”:

  1. Remove the old server (above).
  2. Add a new server with a fresh ZIPNET_SECRET (above).

The committee’s GroupId does not change (it depends on the instance name and shared ZIPNET_COMMITTEE_SECRET, not on individual node identities), so the Raft group persists across the swap. The ServerRegistry entry is updated automatically.

Rotating the committee secret

This is disruptive: changing ZIPNET_COMMITTEE_SECRET changes the GroupId, so the old committee is abandoned. External publishers compiled against the instance name still bond, but the committee they find is new.

  1. Announce a maintenance window.
  2. Stop every client, aggregator, and committee server on this instance.
  3. Distribute the new ZIPNET_COMMITTEE_SECRET to all committee members.
  4. Start the committee first, then the aggregator, then the clients.

Rotating round parameters

RoundParams (num_slots, slot_bytes, tag_len) is folded into the committee’s state-machine signature. Changing it is equivalent to rotating the committee secret (above), and it is a breaking change for any publisher that compiled the old parameters in — meaning in practice you bump the instance.

See Retiring and replacing an instance below.

Dev note

Developers changing RoundParams in code must also bump the signature string in CommitteeMachine::signature() when appropriate — otherwise old and new nodes silently derive the same GroupId but disagree on apply semantics. See The committee state machine.

Rebuilding a TDX image

Rebuilding the committee or client image produces a new MR_TD. The committee’s ticket validator is pinned to a specific MR_TD, so a rebuild requires coordinated rollout:

  1. Build the new image with cargo build --release (the mosaik TDX builder runs in build.rs, producing a fresh mrtd.hex).
  2. Publish the new mrtd.hex to your release-notes channel.
  3. Decide whether the change is ABI-compatible with the current committee’s expectations:
    • Patch-level image change (kernel patch, initramfs tweak, no wire-format or state-machine change): accept both MR_TDs transiently by updating the committee’s require_mrtd list to include the new hash, roll the committee hosts one at a time to the new image, then drop the old MR_TD from the allow-list.
    • Breaking change (new state-machine signature, new wire format, new RoundParams): treat it as retiring the instance (below).
  4. Sign and publish the new MR_TD, along with the retirement window for the old one, so publishers can rebuild their own images in time.

Retiring and replacing an instance

Use this path whenever a cross-compatibility boundary moves (RoundParams, CommitteeMachine::signature, wire format, breaking MR_TD change). You have two idiomatic versioning stories:

  • Version in the name. Stand up the new deployment under a new instance name (acme.mainnet.v2). Old and new run in parallel for the transition window; publishers re-pin and rebuild at their own pace; you tear down the old instance when traffic has drained. The cleanest story for external publishers; forces them to cut a release.
  • Lockstep release against a shared deployment crate. Keep the instance name stable, cut a new deployment-crate version pinning the new state-machine signature, and coordinate operator + publisher upgrades as a single release event. Avoids instance-ID churn at the cost of tighter release-cadence coupling.

Zipnet v1 does not mandate which you pick; see Designing coexisting systems on mosaik — Versioning under stable instance names for the full tradeoff.

Retirement itself is just stopping every server under the old instance name. Publishers still trying to bond see ConnectTimeout; they rebuild against the new name or the new deployment crate and reconnect.

Upgrading the binary

Patch-level upgrades (no CommitteeMachine::signature change, no RoundParams change, no wire format change, no MR_TD change if TDX-gated) are safe to roll one node at a time following the restart procedure.

Upgrades that change any of those four cross a compatibility boundary — treat them like retiring the instance.

Dev notes on where to look in source:

  • WIRE_VERSION in crates/zipnet-proto/src/lib.rs
  • CommitteeMachine::signature in crates/zipnet-node/src/committee.rs
  • RoundParams::default_v1 in crates/zipnet-proto/src/params.rs

Any change to those requires a coordinated restart of the whole instance.

See also