Metrics reference
audience: operators
Every zipnet binary exposes a Prometheus endpoint when
ZIPNET_METRICS is set. The table below lists the metrics worth
scraping in production. Metrics starting with mosaik_ are emitted
by the underlying mosaik library and documented in the
mosaik book — Metrics;
the ones that are load-bearing for zipnet operations are listed here.
Metrics that are instance-scoped carry an instance label whose value
is the operator’s ZIPNET_INSTANCE string (e.g. acme.mainnet).
When a host multiplexes several instances (see
Operator quickstart — running many instances),
every instance-scoped metric is emitted once per instance.
Per-role metrics
Committee server
| Metric | Kind | Meaning | Healthy value |
|---|---|---|---|
mosaik_groups_leader_is_local{instance=<name>} | gauge (0/1) | Whether this node is the Raft leader for the instance | Exactly one 1 across the committee of each instance |
mosaik_groups_bonds{peer=<id>,instance=<name>} | gauge (0/1) | Whether a bond to a specific peer is healthy | 1 for every other committee member of the same instance |
mosaik_groups_committed_index{instance=<name>} | gauge | Highest committed Raft index | Monotonically increasing, step ≈ 2 per round |
zipnet_rounds_finalized_total{instance=<name>} | counter | Rounds this node saw finalize | Increases at ~1 / ZIPNET_ROUND_PERIOD |
zipnet_partials_submitted_total{instance=<name>} | counter | Partials this node contributed | Increases 1-per-round |
zipnet_client_registry_size{instance=<name>} | gauge | Clients currently registered | Roughly = expected client count |
zipnet_server_registry_size{instance=<name>} | gauge | Servers currently registered | Equals committee size |
The mosaik_groups_leader_is_local gauge is the one the operator
quickstart tells you to check when bringing a new instance up —
exactly one committee node should report 1 per instance.
Aggregator
| Metric | Kind | Meaning | Healthy value |
|---|---|---|---|
mosaik_streams_consumer_subscribed_producers{stream=<id>,instance=<name>} | gauge | Number of producers this consumer is attached to | = client count for ClientToAggregator |
mosaik_streams_producer_subscribed_consumers{stream=<id>,instance=<name>} | gauge | Number of consumers attached to this producer | = committee size for AggregateToServers |
zipnet_aggregates_forwarded_total{instance=<name>} | counter | Aggregates sent to the committee | ≈ rounds finalized |
zipnet_fold_participants{round=<r>,instance=<name>} | histogram | Clients per folded round | Depends on your client count |
zipnet_clients_registered_total{instance=<name>} | counter | Client bundles mirrored into ClientRegistry | Grows to client count, then plateaus |
Client
| Metric | Kind | Meaning | Healthy value |
|---|---|---|---|
zipnet_envelopes_sent_total{instance=<name>} | counter | Envelopes sealed and pushed | Increases by 1 per talk round |
zipnet_envelope_send_errors_total{instance=<name>} | counter | send failures | Ideally 0 |
zipnet_client_registered{instance=<name>} | gauge (0/1) | Whether our bundle is in ClientRegistry | 1 after the first few seconds |
Metrics that indicate trouble
| Metric | Fires when | First action |
|---|---|---|
mosaik_groups_leader_is_local is 1 on zero or ≥ 2 nodes of one instance for > 1 min | Split-brain or no leader | Incident response — split-brain |
mosaik_streams_consumer_subscribed_producers drops to 0 on the aggregator | Clients disconnected | Check client-side logs for bootstrap failures |
zipnet_aggregates_forwarded_total flat for > 3 × ZIPNET_ROUND_PERIOD | Aggregator stuck OR committee cannot open rounds | Incident response — stuck rounds |
zipnet_server_registry_size < committee_size for > 30 s | A committee server failed to publish | Check that server’s boot log |
mosaik_groups_committed_index frozen | Raft stalled | Check clock skew, network partition |
Every trouble alert should be scoped by instance so multi-instance
hosts do not conflate a stuck testnet with a stuck production
committee.
Recording rules for Prometheus
Useful derived series (all scoped by instance):
# Round cadence per instance
rate(zipnet_rounds_finalized_total[5m])
# Average participants per round per instance
rate(zipnet_fold_participants_sum[5m])
/ rate(zipnet_fold_participants_count[5m])
# Aggregator fold saturation (clients dropped by the deadline)
(
rate(zipnet_clients_registered_total[5m])
-
rate(zipnet_fold_participants_sum[5m]) / rate(zipnet_rounds_finalized_total[5m])
)
Logs that should never fire (without a concurrent alert)
rival group leader detectedon any committee server.SubmitAggregate with bad length/SubmitPartial with bad lengthin a committee log.failed to mirror LiveRoundCellpersistently.committee offline — aggregate dropped— either the committee is down or bundle tickets never replicated.
If any of these fire without a concurrent incident, treat it as a protocol invariant break and escalate to the contributor on-call.