Monitoring and alerts
audience: operators
Zipnet inherits mosaik’s Prometheus exporter. Enable it by setting
ZIPNET_METRICS=0.0.0.0:9100 (or a port of your choice) on every
node you want scraped. See Metrics reference
for the complete list; this page covers the metrics that actually
tell you whether an instance is healthy.
All zipnet-emitted metrics carry an instance="<name>" label set
from ZIPNET_INSTANCE. Scope your alert rules on that label so a
stuck preview.alpha doesn’t page the on-call for acme.mainnet.
The three questions you ask every shift
1. “Are rounds finalizing?”
The authoritative signal is new entries appearing in the
Broadcasts collection. Track the rate of round finalized log
events on committee servers (INFO level). A healthy instance
finalizes one round per ZIPNET_ROUND_PERIOD interval, plus or
minus ZIPNET_FOLD_DEADLINE.
Alert condition: no round finalized event on a leader server for
3 × ROUND_PERIOD + ROUND_DEADLINE.
2. “Is the committee healthy?”
- Exactly one committee server in this instance should report itself
as leader at any one time. If zero or two-plus, investigate (see
Incident response — split-brain).
The relevant metric is
mosaik_groups_leader_is_local{instance="…"}. - Bond count per server should equal
N − 1where N is the committee size. A dropped bond suggests a universe-level partition or an expired ticket. - Raft log position should advance in lockstep across servers. A persistent lag (> 5 indices) on one server indicates that node is falling behind.
3. “Are clients and their pubkeys reaching the committee?”
ClientRegistrysize ≈ number of clients you launched for this instance, give or take gossip cycles.- Per-round
participantscount inround finalizedevents ≈ the number of non-idle clients.
Alert condition: participants = 0 for two consecutive rounds while
you expected > 0.
Useful log filters
On committee servers:
journalctl -u zipnet-server@acme-mainnet -f \
--grep='round finalized|opening round|submitted partial|SubmitAggregate|rival group leader'
On the aggregator:
journalctl -u zipnet-aggregator@acme-mainnet -f \
--grep='forwarded aggregate|registering client'
On clients:
journalctl -u zipnet-client@acme-mainnet -f \
--grep='sealed envelope|registration'
(Adjust for your process supervisor.)
Baseline expectations at default parameters
| Condition | Committee server | Aggregator | Client |
|---|---|---|---|
| Steady-state CPU | < 5 % on a mid-range core | varies with client count | < 1 % |
| RAM | 50–200 MB | 100–500 MB | 20–50 MB |
| Bond count | committee_size − 1 | 0 (not a group member) | 0 |
| Gossip catalog size | total universe node count ± 2 | total universe node count ± 2 | total universe node count ± 2 |
| Inbound per round | N × B / committee_size (replication) | N × B | B / client |
| Outbound per round | B + heartbeats | committee_size × B | B |
N = clients, B = broadcast vector bytes (default 16 KiB).
Dev note
The gossip catalog includes peers from every service on the shared universe, not just zipnet. Your catalog size may be much larger than your committee size if the universe also hosts multisig signers, oracles, or other mosaik agents. Do not alert on absolute catalog size; alert on change in catalog size relative to a baseline.
Sensible alerts to configure
- Round stall. No new
Broadcastsentry for3 × ROUND_PERIOD + ROUND_DEADLINE. Page on-call: committee is stuck, aggregator is down, ormin_participantsis unmet. - Committee partition.
sum by (instance) (mosaik_groups_leader_is_local{instance="…"})is 0 or ≥ 2 for more than 1 minute. Page on-call. - TDX attestation approaching expiry. Less than 24 h to ticket
expon any bonded peer. Page TEE operator. - Bond drop.
mosaik_groups_bonds{peer=<known>,instance="…"}drops from 1 to 0 for more than 30 s and does not recover.
Multi-instance dashboards
Since multiple instances share the same universe and the same host
fleet, build dashboards with instance as a dimension from the
start:
- A top-level panel showing
rate(zipnet_round_finalized_total[1m])broken out byinstance. - A committee-health grid: rows are instances, columns are the
committee members, cells are
mosaik_groups_leader_is_local. - A per-instance heatmap of
participantsover time — sparse rounds are often the first hint of a sick publisher fleet.
A starter Grafana dashboard is not shipped in v1. The metrics list in Metrics reference is sufficient to build one. A community-maintained dashboard is tracked as a v2 follow-up.
See also
- Incident response — what to check when an alert fires.
- Metrics reference — the full label and metric list.