Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Troubleshooting from the user side

audience: users

Failure modes you can observe from your own agent, mapped to the SDK’s error enum and the fastest check for each.

The error enum

pub enum Error {
    WrongUniverse { expected: mosaik::NetworkId, actual: mosaik::NetworkId },
    ConnectTimeout,
    Attestation(String),
    Shutdown,
    Protocol(String),
}

Five variants. The two you will hit most in development are ConnectTimeout and WrongUniverse. Everything else is either a real runtime condition or lower-level plumbing surfaced through Protocol.

Symptom: a Zipnet::<D>::* constructor returns ConnectTimeout

This is the single most common dev-time error. It means the SDK could not bond to a peer serving your deployment within the connect deadline. In descending order of likelihood:

1. Config mismatch with the operator

Every field in your Config — name, window, init — plus your datum’s TYPE_TAG and WIRE_SIZE folds into the deployment’s on-wire identity. A one-character difference in the name, a different ShuffleWindow preset, or a stale init salt produces a completely different id and nobody is serving it.

Fix: double-check every field against the operator’s handoff. Prefer pinning the Config as a compile-time constant so the fingerprint is immutable at source:

use zipnet::{Config, ShuffleWindow, Zipnet};

const ACME_MAINNET: Config = Config::new("acme.mainnet")
    .with_window(ShuffleWindow::interactive())
    .with_init([
        0x7f, 0x3a, 0x9b, 0x1c, /* … operator-published bytes … */ 0x00,
    ]);

// Print the derived id on both sides to confirm agreement.
println!("{}", Zipnet::<Note>::deployment_id(&ACME_MAINNET));

2. Operator’s committee isn’t up

The fingerprint is right, but nobody is currently serving it. The SDK cannot distinguish “nobody serves this” from “operator isn’t up yet” without an on-network registry — both surface as ConnectTimeout.

Fix: ask the operator whether the deployment is live.

3. Bootstrap peers unreachable

Even if the instance name is right and the committee is up, your network never bonded to the universe — so it never found the committee. Usually shows up alongside no peer-catalog growth.

Fix: check the bootstrap peer list. See Connecting — Cold-start checklist.

4. TDX posture mismatch

Silent rejection at the bond layer from a TDX-gated deployment often looks like ConnectTimeout rather than a clear Attestation error. Common when your client is built without the tee-tdx feature against a TDX-gated operator.

Fix: see TEE-gated deployments.

Symptom: a Zipnet::<D>::* constructor returns WrongUniverse

Your Arc<Network> was built against a different NetworkId than zipnet::UNIVERSE. The error payload tells you both values:

match zipnet::Zipnet::<Note>::submit(&network, &ACME_MAINNET).await {
    Err(zipnet::Error::WrongUniverse { expected, actual }) => {
        tracing::error!(%expected, %actual, "network on wrong universe");
    }
    _ => {}
}

Fix: build the network with Network::new(UNIVERSE) or Network::builder(UNIVERSE). There is no way to tunnel zipnet over a non-universe network.

Symptom: a Zipnet::<D>::* constructor returns Attestation

TDX attestation failed. The string payload names the specific failure from the mosaik TDX stack.

Common causes:

  • You built with tee-tdx but aren’t running inside a TDX guest.
  • Your MR_TD differs from the operator’s expected value (fresh image you haven’t rebuilt, or operator rotated).
  • Your quote has expired.

See TEE-gated deployments.

Symptom: the Reader<D> never yields a note you submitted

Three common causes, in rough frequency order.

1. Slot collision

Another client in the same round hashed to the same slot you did. Both payloads XOR-corrupt each other; neither lands. Retry on the next round — see Publishing — Retry policy.

Persistent collisions mean the deployment is oversubscribed for its num_slots (an operator-side tuning concern). Collision probability per pair per round is 1 / num_slots; for N clients the expected number of collisions per round is C(N, 2) / num_slots.

2. Aggregator dropped your envelope

The aggregator never forwarded your envelope into a committed aggregate. Usually transient:

  • Aggregator was offline that round.
  • Your registration hadn’t propagated yet (first few seconds after opening the submitter).

Retry. Repeated failures across many rounds mean the aggregator is unreachable from you — check the peer catalog and bootstrap peers, then contact the operator.

3. Your client isn’t in the live-round roster

The SDK’s driver re-sends its ClientBundle nudge every round it finds itself outside the roster. First-time admission can take one or two rounds even under ideal conditions.

v1 note. The SDK does not currently surface which of these three outcomes happened — the ECIES receipts stream that would tell you is deferred to v2. Distinguishing the three requires application-level retries with byte-equality checks on the Reader<D> stream.

Symptom: the reader sees no values for a long time

Two possibilities:

1. The committee is stuck

The cluster is not finalizing rounds. Contact the operator.

2. Your handle hasn’t caught up yet

The Zipnet::<D>::* constructors wait for the first live round roster before returning, so once you hold a Reader<D> values should start at the next round boundary. If they do not, you are not reaching the broadcast collection’s group — same checks as for ConnectTimeout (config, bootstrap, UDP egress, TDX).

Symptom: send or stream next returns Shutdown

The handle is closing. Either you dropped every clone of the submitter, dropped the reader, or the underlying Network went down.

Fix: check that the Arc<Network> is still alive and that no other part of your code dropped the handle. If this is intentional, the error is just the post-drop signal.

Symptom: Error::Protocol(…) with an opaque string

The SDK bubbled up a lower-level mosaik or zipnet-protocol failure. The string content is for humans — do not pattern-match on it.

Fix: enable verbose logging and inspect the mosaik-layer event stream:

RUST_LOG=info,zipnet=debug,mosaik=info cargo run

If the root cause is in mosaik, the mosaik book has better diagnostics than this page can. Open a zipnet issue with the log excerpt if the failure looks zipnet-specific.

Symptom: reader lags and misses values

Your per-item handler is slower than the deployment’s round cadence. Internal broadcast channels drop items rather than stall the SDK.

Fix: offload heavy per-item work to a separate task. See Reading — Handling a slow consumer.

Symptom: my client compiled against one version, the operator upgraded

Mosaik pinned to =0.3.17 on both sides; zipnet and zipnet-proto baselines must also match the deployment. If WIRE_VERSION or round-parameter defaults change, your client derives a different deployment id and the Zipnet::<D>::* constructors return ConnectTimeout.

Fix: keep your zipnet dep version aligned with the operator’s release notes. Mosaik stays pinned.

When to escalate to the operator

  • A Zipnet::<D>::* constructor consistently fails with ConnectTimeout after the config, bootstrap, and universe have all been verified.
  • The reader never yields notes you submitted, even after several round periods of retry.
  • Your reader stays open but sees no values over several round periods.

When you escalate, include:

  • Your mosaik version (=0.3.17) and zipnet SDK version.
  • The full Config (name, window, init) and the Zipnet::<D>::deployment_id(&CONFIG) you derive locally.
  • Whether you built with tee-tdx and, if so, your client’s MR_TD.
  • A 60-second log excerpt at RUST_LOG=info,zipnet=debug,mosaik=info.