Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Monitoring and alerts

audience: operators

Zipnet inherits mosaik’s Prometheus exporter. Enable it by setting ZIPNET_METRICS=0.0.0.0:9100 (or a port of your choice) on every node you want scraped. See Metrics reference for the complete list; this page covers the metrics that actually tell you whether an instance is healthy.

All zipnet-emitted metrics carry an instance="<name>" label set from ZIPNET_INSTANCE. Scope your alert rules on that label so a stuck preview.alpha doesn’t page the on-call for acme.mainnet.

The three questions you ask every shift

1. “Are rounds finalizing?”

The authoritative signal is new entries appearing in the Broadcasts collection. Track the rate of round finalized log events on committee servers (INFO level). A healthy instance finalizes one round per ZIPNET_ROUND_PERIOD interval, plus or minus ZIPNET_FOLD_DEADLINE.

Alert condition: no round finalized event on a leader server for 3 × ROUND_PERIOD + ROUND_DEADLINE.

2. “Is the committee healthy?”

  • Exactly one committee server in this instance should report itself as leader at any one time. If zero or two-plus, investigate (see Incident response — split-brain). The relevant metric is mosaik_groups_leader_is_local{instance="…"}.
  • Bond count per server should equal N − 1 where N is the committee size. A dropped bond suggests a universe-level partition or an expired ticket.
  • Raft log position should advance in lockstep across servers. A persistent lag (> 5 indices) on one server indicates that node is falling behind.

3. “Are clients and their pubkeys reaching the committee?”

  • ClientRegistry size ≈ number of clients you launched for this instance, give or take gossip cycles.
  • Per-round participants count in round finalized events ≈ the number of non-idle clients.

Alert condition: participants = 0 for two consecutive rounds while you expected > 0.

Useful log filters

On committee servers:

journalctl -u zipnet-server@acme-mainnet -f \
  --grep='round finalized|opening round|submitted partial|SubmitAggregate|rival group leader'

On the aggregator:

journalctl -u zipnet-aggregator@acme-mainnet -f \
  --grep='forwarded aggregate|registering client'

On clients:

journalctl -u zipnet-client@acme-mainnet -f \
  --grep='sealed envelope|registration'

(Adjust for your process supervisor.)

Baseline expectations at default parameters

ConditionCommittee serverAggregatorClient
Steady-state CPU< 5 % on a mid-range corevaries with client count< 1 %
RAM50–200 MB100–500 MB20–50 MB
Bond countcommittee_size − 10 (not a group member)0
Gossip catalog sizetotal universe node count ± 2total universe node count ± 2total universe node count ± 2
Inbound per roundN × B / committee_size (replication)N × BB / client
Outbound per roundB + heartbeatscommittee_size × BB

N = clients, B = broadcast vector bytes (default 16 KiB).

Dev note

The gossip catalog includes peers from every service on the shared universe, not just zipnet. Your catalog size may be much larger than your committee size if the universe also hosts multisig signers, oracles, or other mosaik agents. Do not alert on absolute catalog size; alert on change in catalog size relative to a baseline.

Sensible alerts to configure

  1. Round stall. No new Broadcasts entry for 3 × ROUND_PERIOD + ROUND_DEADLINE. Page on-call: committee is stuck, aggregator is down, or min_participants is unmet.
  2. Committee partition. sum by (instance) (mosaik_groups_leader_is_local{instance="…"}) is 0 or ≥ 2 for more than 1 minute. Page on-call.
  3. TDX attestation approaching expiry. Less than 24 h to ticket exp on any bonded peer. Page TEE operator.
  4. Bond drop. mosaik_groups_bonds{peer=<known>,instance="…"} drops from 1 to 0 for more than 30 s and does not recover.

Multi-instance dashboards

Since multiple instances share the same universe and the same host fleet, build dashboards with instance as a dimension from the start:

  • A top-level panel showing rate(zipnet_round_finalized_total[1m]) broken out by instance.
  • A committee-health grid: rows are instances, columns are the committee members, cells are mosaik_groups_leader_is_local.
  • A per-instance heatmap of participants over time — sparse rounds are often the first hint of a sick publisher fleet.

A starter Grafana dashboard is not shipped in v1. The metrics list in Metrics reference is sufficient to build one. A community-maintained dashboard is tracked as a v2 follow-up.

See also