Incident response
audience: operators
This page is a runbook. It lists the failure modes we have actually observed in testing and the minimal steps that resolve each. Each section is scoped to a single instance — if multiple instances on the same universe are misbehaving at once, something is wrong at the universe level (relays, DHT, network) rather than in any one instance, and you should start with the “Discovery is slow” section.
Stuck rounds
Symptom: no round finalized log on any committee server in this
instance for more than 3 × ROUND_PERIOD + ROUND_DEADLINE.
Root-cause checklist, in order of likelihood:
-
Fewer active clients than
ZIPNET_MIN_PARTICIPANTS. The leader won’t open a round until this threshold is met.- Check:
zipnet_client_registry_size{instance="…"}on any committee server. - Fix: either start more clients (or cover-traffic filler) or
lower
ZIPNET_MIN_PARTICIPANTS(rolling restart of the committee — this is in the state machine’s signature derivation, so everyone needs the same value).
- Check:
-
Committee has no leader. Raft election has not settled (yet, or ever).
- Check:
mosaik_groups_leader_is_local{instance="…"} == 0on all members. - Fix: usually self-heals within
ELECTION_TIMEOUT + BOOTSTRAP_DELAY. If persistent, suspect clock skew or a full network partition.
- Check:
-
Client bundles have not replicated to the committee. Clients have connected but their bundles haven’t landed in
ClientRegistry— the aggregator hasn’t yet mirrored them in.- Check: aggregator log for
registering client bundle; this should fire for each new client. - Fix: ensure the aggregator is reachable from every client
(correct
ZIPNET_BOOTSTRAPor working universe discovery). Wait one gossip cycle (≈ 15 s).
- Check: aggregator log for
-
One or more server bundles missing from
ServerRegistry. A committee server failed to self-publish.- Check: query
ServerRegistrysize on each committee server; should equal committee size. - Fix: restart the offending server; it re-publishes on boot.
- Check: query
If a publisher reports Error::ConnectTimeout that traces back to
any of the root causes above, it is an operator-side issue
surfacing as a user-side error. The SDK cannot distinguish “my
instance name is wrong” from “the operator’s committee is stuck” —
that’s a deliberate tradeoff of the no-registry design.
Split-brain
Symptom: two or more committee servers in this instance report
mosaik_groups_leader_is_local == 1, or a server’s log shows
rival group leader detected.
v1 uses mosaik’s modified Raft which resolves rivals by term. The
system self-heals within one ELECTION_TIMEOUT. If it does not
self-heal:
- Check clock skew across committee members (
ntpdate -qon each). More than a few seconds of skew breaks Raft timing. - Check the network — split-brain persisting past self-heal is a partition.
- As a last resort,
SIGTERMthe minority faction. They’ll rejoin as followers.
Do not change
ZIPNET_COMMITTEE_SECRETmid-incident. It would force a fresh committee group and hide evidence of the split, not resolve it.
Committee quorum loss
Symptom: fewer than a majority of committee servers are reachable. Rounds cannot commit.
- Restore the failed nodes. They rejoin on startup.
- If restoration is impossible (hardware loss, etc.), a v1 deployment has no graceful recovery — retire the instance and stand up a fresh one under a new name (or bump the deployment crate version). See Rotations and upgrades — Retiring and replacing an instance.
Aggregator crash-loop
Symptom: aggregator exits or OOMs shortly after boot.
Most common cause in v1: too many concurrent clients pushing
envelopes larger than the internal buffer (buffer_size = 1024 per
mosaik default).
Fix: either lower client concurrency by splitting the publisher
fleet across multiple instances (each with its own
ZIPNET_INSTANCE), or tune the aggregator’s stream buffer when
calling
network.streams().consumer::<ClientEnvelope>().with_buffer_size(N)
— this requires a code change in zipnet-node (dev task).
TDX attestation expiry
Symptom: committee rejects a previously-good peer with
unauthorized; the peer re-bonds in a loop with the same outcome.
On the peer side, logs mention an expired quote.
Causes, in order of likelihood:
- Quote
expelapsed. Each TDX quote carries an expiration. The bonded peer needs a fresh quote.- Fix: restart the peer. On restart the TDX layer fetches a new quote from the hardware. If the peer still fails, check the TDX host’s attestation service reachability.
- Clock skew between the peer and the committee. The committee
rejects a quote whose
exphas already passed in its local clock.- Fix: NTP on both sides.
- MR_TD mismatch. The peer is running a different image than
the committee expects. Common after a committee rebuild the peer
hasn’t yet picked up.
- Fix: re-build the peer image from the current release, or see Rotations and upgrades — Rebuilding a TDX image for the transition plan.
Discovery is slow (universe-level)
Symptom: nodes log Could not bootstrap the routing table and take
minutes to find each other. Typically affects all instances on
the same universe simultaneously.
Usual cause: iroh’s pkarr / Mainline DHT bootstrap is struggling (common on fresh residential networks or a fresh universe). Workarounds:
- Pass an explicit
ZIPNET_BOOTSTRAP=<peer_id>on every non-bootstrap node. - Enable mDNS discovery (already on by default in this prototype). For LAN deployments this is often enough.
- Run a mosaik bootstrap node (see mosaik’s
examples/bootstrap.rs) with a well-known public address and seed it everywhere.
A dedicated bootstrap node is recommended for any production universe that hosts more than one zipnet instance.
When to escalate
- Unknown log messages containing
committedorrevertedoutside the expected Raft lifecycle. Broadcastscollection contains entries where the number of servers in the record does not match your configured committee size for this instance.- Any indication that two clients with the same
ClientIdcoexist (would mean someone forged a bundle — investigate as a security incident). - Publishers reporting
WrongUniverse— indicates an operator misconfiguration ofZIPNET_UNIVERSE, or a publisher using the wrongzipnet::UNIVERSEconstant.
See also
- Monitoring and alerts — the alerts that surface these conditions.
- Rotations and upgrades — controlled changes that avoid these incidents in the first place.