Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Incident response

audience: operators

This page is a runbook. It lists the failure modes we have actually observed in testing and the minimal steps that resolve each. Each section is scoped to a single instance — if multiple instances on the same universe are misbehaving at once, something is wrong at the universe level (relays, DHT, network) rather than in any one instance, and you should start with the “Discovery is slow” section.

Stuck rounds

Symptom: no round finalized log on any committee server in this instance for more than 3 × ROUND_PERIOD + ROUND_DEADLINE.

Root-cause checklist, in order of likelihood:

  1. Fewer active clients than ZIPNET_MIN_PARTICIPANTS. The leader won’t open a round until this threshold is met.

    • Check: zipnet_client_registry_size{instance="…"} on any committee server.
    • Fix: either start more clients (or cover-traffic filler) or lower ZIPNET_MIN_PARTICIPANTS (rolling restart of the committee — this is in the state machine’s signature derivation, so everyone needs the same value).
  2. Committee has no leader. Raft election has not settled (yet, or ever).

    • Check: mosaik_groups_leader_is_local{instance="…"} == 0 on all members.
    • Fix: usually self-heals within ELECTION_TIMEOUT + BOOTSTRAP_DELAY. If persistent, suspect clock skew or a full network partition.
  3. Client bundles have not replicated to the committee. Clients have connected but their bundles haven’t landed in ClientRegistry — the aggregator hasn’t yet mirrored them in.

    • Check: aggregator log for registering client bundle; this should fire for each new client.
    • Fix: ensure the aggregator is reachable from every client (correct ZIPNET_BOOTSTRAP or working universe discovery). Wait one gossip cycle (≈ 15 s).
  4. One or more server bundles missing from ServerRegistry. A committee server failed to self-publish.

    • Check: query ServerRegistry size on each committee server; should equal committee size.
    • Fix: restart the offending server; it re-publishes on boot.

If a publisher reports Error::ConnectTimeout that traces back to any of the root causes above, it is an operator-side issue surfacing as a user-side error. The SDK cannot distinguish “my instance name is wrong” from “the operator’s committee is stuck” — that’s a deliberate tradeoff of the no-registry design.

Split-brain

Symptom: two or more committee servers in this instance report mosaik_groups_leader_is_local == 1, or a server’s log shows rival group leader detected.

v1 uses mosaik’s modified Raft which resolves rivals by term. The system self-heals within one ELECTION_TIMEOUT. If it does not self-heal:

  1. Check clock skew across committee members (ntpdate -q on each). More than a few seconds of skew breaks Raft timing.
  2. Check the network — split-brain persisting past self-heal is a partition.
  3. As a last resort, SIGTERM the minority faction. They’ll rejoin as followers.

Do not change ZIPNET_COMMITTEE_SECRET mid-incident. It would force a fresh committee group and hide evidence of the split, not resolve it.

Committee quorum loss

Symptom: fewer than a majority of committee servers are reachable. Rounds cannot commit.

  1. Restore the failed nodes. They rejoin on startup.
  2. If restoration is impossible (hardware loss, etc.), a v1 deployment has no graceful recovery — retire the instance and stand up a fresh one under a new name (or bump the deployment crate version). See Rotations and upgrades — Retiring and replacing an instance.

Aggregator crash-loop

Symptom: aggregator exits or OOMs shortly after boot.

Most common cause in v1: too many concurrent clients pushing envelopes larger than the internal buffer (buffer_size = 1024 per mosaik default).

Fix: either lower client concurrency by splitting the publisher fleet across multiple instances (each with its own ZIPNET_INSTANCE), or tune the aggregator’s stream buffer when calling network.streams().consumer::<ClientEnvelope>().with_buffer_size(N) — this requires a code change in zipnet-node (dev task).

TDX attestation expiry

Symptom: committee rejects a previously-good peer with unauthorized; the peer re-bonds in a loop with the same outcome. On the peer side, logs mention an expired quote.

Causes, in order of likelihood:

  1. Quote exp elapsed. Each TDX quote carries an expiration. The bonded peer needs a fresh quote.
    • Fix: restart the peer. On restart the TDX layer fetches a new quote from the hardware. If the peer still fails, check the TDX host’s attestation service reachability.
  2. Clock skew between the peer and the committee. The committee rejects a quote whose exp has already passed in its local clock.
    • Fix: NTP on both sides.
  3. MR_TD mismatch. The peer is running a different image than the committee expects. Common after a committee rebuild the peer hasn’t yet picked up.

Discovery is slow (universe-level)

Symptom: nodes log Could not bootstrap the routing table and take minutes to find each other. Typically affects all instances on the same universe simultaneously.

Usual cause: iroh’s pkarr / Mainline DHT bootstrap is struggling (common on fresh residential networks or a fresh universe). Workarounds:

  1. Pass an explicit ZIPNET_BOOTSTRAP=<peer_id> on every non-bootstrap node.
  2. Enable mDNS discovery (already on by default in this prototype). For LAN deployments this is often enough.
  3. Run a mosaik bootstrap node (see mosaik’s examples/bootstrap.rs) with a well-known public address and seed it everywhere.

A dedicated bootstrap node is recommended for any production universe that hosts more than one zipnet instance.

When to escalate

  • Unknown log messages containing committed or reverted outside the expected Raft lifecycle.
  • Broadcasts collection contains entries where the number of servers in the record does not match your configured committee size for this instance.
  • Any indication that two clients with the same ClientId coexist (would mean someone forged a bundle — investigate as a security incident).
  • Publishers reporting WrongUniverse — indicates an operator misconfiguration of ZIPNET_UNIVERSE, or a publisher using the wrong zipnet::UNIVERSE constant.

See also