Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Metrics reference

audience: operators

Every zipnet binary exposes a Prometheus endpoint when ZIPNET_METRICS is set. The table below lists the metrics worth scraping in production. Metrics starting with mosaik_ are emitted by the underlying mosaik library and documented in the mosaik book — Metrics; the ones that are load-bearing for zipnet operations are listed here.

Metrics that are instance-scoped carry an instance label whose value is the operator’s ZIPNET_INSTANCE string (e.g. acme.mainnet). When a host multiplexes several instances (see Operator quickstart — running many instances), every instance-scoped metric is emitted once per instance.

Per-role metrics

Committee server

MetricKindMeaningHealthy value
mosaik_groups_leader_is_local{instance=<name>}gauge (0/1)Whether this node is the Raft leader for the instanceExactly one 1 across the committee of each instance
mosaik_groups_bonds{peer=<id>,instance=<name>}gauge (0/1)Whether a bond to a specific peer is healthy1 for every other committee member of the same instance
mosaik_groups_committed_index{instance=<name>}gaugeHighest committed Raft indexMonotonically increasing, step ≈ 2 per round
zipnet_rounds_finalized_total{instance=<name>}counterRounds this node saw finalizeIncreases at ~1 / ZIPNET_ROUND_PERIOD
zipnet_partials_submitted_total{instance=<name>}counterPartials this node contributedIncreases 1-per-round
zipnet_client_registry_size{instance=<name>}gaugeClients currently registeredRoughly = expected client count
zipnet_server_registry_size{instance=<name>}gaugeServers currently registeredEquals committee size

The mosaik_groups_leader_is_local gauge is the one the operator quickstart tells you to check when bringing a new instance up — exactly one committee node should report 1 per instance.

Aggregator

MetricKindMeaningHealthy value
mosaik_streams_consumer_subscribed_producers{stream=<id>,instance=<name>}gaugeNumber of producers this consumer is attached to= client count for ClientToAggregator
mosaik_streams_producer_subscribed_consumers{stream=<id>,instance=<name>}gaugeNumber of consumers attached to this producer= committee size for AggregateToServers
zipnet_aggregates_forwarded_total{instance=<name>}counterAggregates sent to the committee≈ rounds finalized
zipnet_fold_participants{round=<r>,instance=<name>}histogramClients per folded roundDepends on your client count
zipnet_clients_registered_total{instance=<name>}counterClient bundles mirrored into ClientRegistryGrows to client count, then plateaus

Client

MetricKindMeaningHealthy value
zipnet_envelopes_sent_total{instance=<name>}counterEnvelopes sealed and pushedIncreases by 1 per talk round
zipnet_envelope_send_errors_total{instance=<name>}countersend failuresIdeally 0
zipnet_client_registered{instance=<name>}gauge (0/1)Whether our bundle is in ClientRegistry1 after the first few seconds

Metrics that indicate trouble

MetricFires whenFirst action
mosaik_groups_leader_is_local is 1 on zero or ≥ 2 nodes of one instance for > 1 minSplit-brain or no leaderIncident response — split-brain
mosaik_streams_consumer_subscribed_producers drops to 0 on the aggregatorClients disconnectedCheck client-side logs for bootstrap failures
zipnet_aggregates_forwarded_total flat for > 3 × ZIPNET_ROUND_PERIODAggregator stuck OR committee cannot open roundsIncident response — stuck rounds
zipnet_server_registry_size < committee_size for > 30 sA committee server failed to publishCheck that server’s boot log
mosaik_groups_committed_index frozenRaft stalledCheck clock skew, network partition

Every trouble alert should be scoped by instance so multi-instance hosts do not conflate a stuck testnet with a stuck production committee.

Recording rules for Prometheus

Useful derived series (all scoped by instance):

# Round cadence per instance
rate(zipnet_rounds_finalized_total[5m])

# Average participants per round per instance
  rate(zipnet_fold_participants_sum[5m])
/ rate(zipnet_fold_participants_count[5m])

# Aggregator fold saturation (clients dropped by the deadline)
(
  rate(zipnet_clients_registered_total[5m])
  -
  rate(zipnet_fold_participants_sum[5m]) / rate(zipnet_rounds_finalized_total[5m])
)

Logs that should never fire (without a concurrent alert)

  • rival group leader detected on any committee server.
  • SubmitAggregate with bad length / SubmitPartial with bad length in a committee log.
  • failed to mirror LiveRoundCell persistently.
  • committee offline — aggregate dropped — either the committee is down or bundle tickets never replicated.

If any of these fire without a concurrent incident, treat it as a protocol invariant break and escalate to the contributor on-call.