Distributed Harness and Agent Orchestration
Spec metadata:
- ID:
distributed-harness-orchestration - Status:
planning - Hard depends on:
multi-agent-support,signet-runtime - Registry:
docs/specs/INDEX.md
1) Problem
Signet currently runs as a single daemon per machine, serving one
~/.agents/ workspace. Multi-agent support (migration 043) added
agent_id scoping within that single daemon, but the architecture
assumes co-location: all agents share one SQLite database on one host.
Real deployments need more. A Mac Studio running 3 agents needs to coordinate with a Linux server running 2 more, each with their own daemon instance. Today there is no discovery protocol, no cross-daemon memory routing, and no unified control plane. Operators must manually configure each machine independently.
2) Goals
- Define a control plane protocol for daemon-to-daemon discovery and health.
- Route memory queries across daemon instances by agent_id ownership.
- Preserve agent_id scoping invariant across all cross-daemon operations.
- Support heterogeneous deployments (mixed OS, mixed daemon versions).
- Degrade gracefully when remote daemons are unreachable.
3) Proposed capability set
A) Daemon discovery and registration
Each daemon advertises itself to a designated coordinator (primary daemon)
via periodic heartbeat on a new /api/cluster/heartbeat endpoint. The
heartbeat includes: daemon version, host, port, agent roster, and
capabilities (pipeline enabled, graph enabled, etc.). The coordinator
maintains a cluster_peers in-memory registry with TTL-based expiry.
B) Agent ownership routing
The coordinator maps agent_id to owning daemon. When a memory query arrives for an agent hosted on a remote daemon, the coordinator proxies the request via HTTP. The routing table is derived from heartbeat data. Local agents are served directly; remote agents incur one proxy hop.
C) Cross-daemon memory federation
Read operations (recall, search, expand) can fan out to multiple daemons
when the query spans agents with different visibility policies (shared,
group). Write operations always route to the owning daemon. Federation
respects agent read policy from multi-agent-support (isolated agents
never fan out).
D) Control plane CLI
signet cluster status shows all known daemons, their agents, and health.
signet cluster join <host:port> registers a remote daemon with the
coordinator. signet cluster remove <host> deregisters. All commands
call the coordinator daemon’s HTTP API.
4) Non-goals
- No custom transport protocol (HTTP over existing Hono/axum stack).
- No automatic agent migration between daemons.
- No shared SQLite across network (each daemon owns its database).
- No cloud orchestration or container scheduling.
5) Integration contracts
Distributed Orchestration <-> Multi-Agent Support
- Agent_id scoping from multi-agent is the routing key for federation.
- Agent visibility (isolated/shared/group) determines fan-out eligibility.
- The cluster roster is additive to the per-daemon agent roster.
Distributed Orchestration <-> Signet Runtime
- Runtime session start resolves the target daemon via the routing table.
- Runtime API calls are daemon-addressed, not cluster-addressed (the coordinator resolves once at session start).
Distributed Orchestration <-> Rust Daemon Parity
- The cluster protocol must work identically on JS and Rust daemons.
- Version negotiation in heartbeat enables mixed-version clusters.
6) Rollout phases
Phase 1 (single-coordinator, read-only federation)
- Heartbeat endpoint and peer registry on coordinator daemon.
- CLI
signet cluster statusandsignet cluster join. - Cross-daemon recall proxy for shared-visibility agents.
- No write federation (writes must target the owning daemon directly).
Phase 2 (full federation)
- Write routing to owning daemon via coordinator proxy.
- Fan-out search across multiple daemons with result merging.
- Health-aware routing (skip unreachable daemons, surface degradation).
Phase 3 (resilience)
- Coordinator failover (any daemon can become coordinator).
- Conflict resolution for agent_id collisions across daemons.
- Metrics and dashboard visualization of cluster topology.
7) Validation and tests
- Heartbeat TTL expiry removes stale peers from the registry.
- Recall for a remote agent returns results from the owning daemon.
- Isolated agents never appear in cross-daemon fan-out results.
- Agent_id is present on every federated query (scoping invariant).
- Coordinator proxy adds < 5ms overhead on localhost.
- Graceful degradation: unreachable daemon returns partial results, not error.
8) Success metrics
- Multi-machine deployment operates from a single CLI entry point.
- Cross-daemon recall latency stays under 50ms on LAN.
- Zero data collision across agents on different daemons.
- Cluster survives single-daemon failure without data loss.
9) Open decisions
- Whether the coordinator is a designated daemon or an elected role.
- Authentication model for inter-daemon communication (shared secret vs mTLS).
- Whether to support WAN deployments or restrict to LAN-only in v1.