title: “SSM Integration Research”

State-Space Models for Signet: Research Synthesis

This document consolidates findings from twelve parallel research streams — four codebase explorations and eight web research surveys — into a unified picture of how state-space models (SSMs) could transform Signet’s memory system from a collection of heuristics into a learned temporal engine. It includes hard architectural constraints discovered through research, a validation contract framework with concrete test specifications, and a synthetic data pretraining strategy for immediate validation before any production code is written.

Companion documents:

Literature Review — foundational papers, small/efficient architectures, memory-specific applications, training on personal data
Implementation Survey — reference implementations, production deployments, Rust ecosystem, deployment patterns
Novel Applications — creative and underexplored SSM applications: causal discovery, world models, anomaly detection, 1.58-bit quantization, identity coherence
Graph Intersection — SSMs for temporal knowledge graphs, graph neural network integration, hierarchical multi-scale reasoning, entity salience modeling
Continual Learning Deep Dive — test-time training, catastrophic forgetting, per-user adapters, federated learning, sleep-inspired consolidation
Synthetic Data Generation — synthetic benchmark design, canary patterns, curriculum learning, data augmentation, TSTR validation protocol

title: “SSM Integration Research”

The Thesis

VISION.md describes Signet’s endgame:

“Signet doesn’t just remember — it learns what to remember. A neural network unique to each user, trained on their own interaction patterns, that gets sharper the longer you use it.”

The current pipeline approximates this with a grab bag of heuristics: FTS5 keyword search, cosine similarity, exponential decay (importance * 0.95^ageDays), hardcoded antonym pairs, fixed traversal limits, and a cross-attention scorer sidecar with three critical bugs. Each stage was built independently because no single model could hold state across time.

An SSM is that model. State-space models maintain a compressed hidden state that evolves with each input, learning what to retain and what to forget. They process sequences in linear time with constant memory. They handle irregular temporal sampling natively. And at 2-10M parameters, they fit inside a Rust sidecar binary under 15MB.

The proposal is not “add SSMs in a few places.” It is: SSMs become the learned backbone that every pipeline stage consults. The graph stays. The database stays. FTS5 stays. But the temporal reasoning — what’s important now, what will be important next, what should decay, what contradicts what — moves from hardcoded rules to a model that learns from the user’s own interaction patterns.

title: “SSM Integration Research”

Current Pipeline: The Heuristic Map

The memory pipeline spans ~19,800 LOC across 52+ files in packages/daemon/src/pipeline/. Here is what each stage does today and where it approximates temporal dynamics with static rules:

Significance Gate (`significance-gate.ts`)

What it does: Decides whether a session is worth extracting
Heuristics: Turn count regex, substring entity matching, token-set novelty comparison against last 5 transcripts
Hardcoded thresholds: minTurns=4, minEntityOverlap=2, noveltyThreshold=0.4
Problem: lowerTranscript.includes(row.name.toLowerCase()) fails on postgres/PostgreSQL, plurals, abbreviations. Same thresholds for all users regardless of communication style.

Extraction (`extraction.ts`)

What it does: LLM call to atomize transcript into facts + entities
Heuristics: Regex-based JSON parsing with fence stripping, trailing comma fixing, balanced-brace fallback
Hardcoded limits: MAX_FACTS=20, MAX_ENTITIES=15, MAX_FACT_LENGTH=2000
Problem: No learning from extraction failures. Every session repeats the same brittle parsing. No confidence calibration.

Decision Engine (`decision.ts`)

What it does: For each fact, searches existing memories and decides ADD/UPDATE/DELETE/SKIP via LLM
Heuristics: Alpha-blended BM25 + vector (0.7/0.3), top-5 candidates
Problem: Binary search cutoff (CANDIDATE_LIMIT=5). No temporal reasoning about how technology decisions evolve. No learning from user’s implicit decision preferences.

Contradiction Detection (`contradiction.ts`)

What it does: Fast path (syntactic heuristics) + slow path (LLM)
Heuristics: 34 hardcoded antonym pairs, temporal marker regex, negation polarity detection
Problem: Misses domain-specific contradictions (“REST API” vs “GraphQL endpoint”). No context-aware contradiction (same words may or may not contradict depending on which entity they describe).

Importance Scoring (`hooks.ts:effectiveScore`)

What it does: Ranks memories for injection

The entire function:

if (pinned) return 1.0;
const ageDays = (Date.now() - new Date(createdAt).getTime())
  / (1000 * 60 * 60 * 24);
return importance * 0.95 ** ageDays;

Problem: Single hardcoded exponential. No access frequency, no structural relationships, no temporal context, no entity-awareness. Half-life is ~13.5 days regardless of memory type or user behavior.

Graph Traversal (`graph-traversal.ts`)

What it does: Walks entity -> aspect -> attribute -> dependency chains to construct context
Heuristics: maxAspectsPerEntity=10, maxAttributesPerAspect=20, maxDependencyHops=10, timeoutMs=500, densityScore weights (0.15, 0.05, 0.2)
Problem: Query-agnostic expansion. “What does X depend on?” and “What are X’s properties?” trigger the same traversal. No memory of which aspects were hot in prior turns.

Session Hook Flow (`hooks.ts:handleUserPromptSubmit`)

What it does: On every prompt, runs embedding + hybrid search + dedup + top-5 selection
Heuristics: 5-turn sliding window dedup, 500-char budget
Problem: Stateless per-turn. No session trajectory. Each search is independent. On 30+ turn sessions, embedding cost alone is 1-2s cumulative. No prediction of what will be needed next.

Predictive Scorer (Rust sidecar, `predictor/`)

What it does: Cross-attention scorer that ranks candidate memories using 768-dim embeddings + 17-dim feature vectors
Architecture: CrossAttentionScorer with 7 learned parameter groups, ~1.1M parameters
Critical bugs: (1) Feature vectors 4-element but sidecar expects 17 (silent failure); (2) Cold start exits on training pair count instead of session count; (3) Stale traversal cache never invalidated.
Problem even without bugs: Static feature engineering. Adding new temporal signals requires schema changes + retraining. The 17-dim vector is a hand-built approximation of what an SSM learns natively.

Retention (`retention-worker.ts`)

What it does: Purges old data on fixed schedules
Heuristics: Hard-delete after 30 days, history events after 180 days, completed jobs after 14 days
Problem: Fixed windows. No structural importance awareness. A memory linked to 50 entities decays at the same rate as an orphan.

Summarization (`summary-worker.ts`, `summary-condensation.ts`)

What it does: Session -> arc -> epoch hierarchical summarization
Heuristics: Fixed chunk size (20k chars), arc threshold (8 sessions), epoch threshold (4 arcs)
Problem: All sessions treated equally. No learned priority for what to preserve across condensation levels.

Continuity State (`continuity-state.ts`)

What it does: Accumulates per-session state (queries, remembers, focal entities)
Storage: Arrays of up to 20 queries, 10 remembers, 10 snippets. ~2KB per checkpoint.
Problem: No learned compression. All state equally weighted. No prediction of what matters for recovery.

title: “SSM Integration Research”

What the Literature Says

The Core Architecture: Mamba

The Mamba family (Gu & Dao, 2023-2026) provides the theoretical and practical foundation. Key properties:

Selective state: Parameters (A, B, C, delta) are input-dependent, allowing the model to choose what to remember vs forget per timestep
Linear time, constant memory: Processes sequences of any length with O(N) computation and O(1) memory per step
Hardware-aware: Parallel scan for training, recurrent mode for inference. No KV cache.
Mamba-3 (March 16, 2026 — 4 days ago): Inference-first design. Complex-valued state (models cyclical patterns like daily/weekly rhythms). MIMO formulation (better predictions at same memory cost). Achieves Mamba-2 quality with half the state size.

The Brain-Database Analogy

Albert Gu’s framing (July 2025): Transformers are databases (store everything, retrieve exactly). SSMs are brains (fixed-size memory, always processing, retain the shape and flow rather than every detail).

Signet already has the database (SQLite + FTS5 + vector search). What it lacks is the brain — the compressed temporal state that knows what matters without looking it up. SSMs provide this. The two complement each other.

Proven at Our Scale

IBM FlowState: 9.1M parameters, outperforms models 20x its size on temporal prediction. S5-based. Proves sub-10M SSMs are viable.
SS4Rec / SSD4Rec: SSMs predicting “what’s relevant next” from user interaction sequences. Direct analogy to memory relevance scoring.
DyGMamba: Dual-SSM for temporal knowledge graph link prediction. Directly applicable to entity relationship evolution.
RankMamba: Mamba-130m competitive with BERT on document ranking.
SleepGate (March 15, 2026): Three modules for conflict detection, selective forgetting, and consolidation. Maps directly to Signet’s contradiction detection, retention decay, and session summarization.

Inference Performance

4-layer Mamba at d64: 0.38ms per sample (10.8x faster than equivalent transformer)
Mamba-3 at 16K context: ~7x faster than Llama-3.2-1B
Sub-10M model quantized with heterogeneous INT8/INT4: <10MB, <1ms inference

Rust Ecosystem

Three Mamba implementations exist in Rust:

mamba.rs: Pure Rust, minimal deps, memory-mapped weights (Apache 2.0 / MIT)
mamba-ssm: Candle-based, tested on Apple Silicon (MIT)
oxidizr/blazr: Full training + inference stack, supports Mamba-2/3 (production-grade)

Plus web-rwkv for RWKV-7 inference in Rust with quantization. And llama.cpp supports Mamba via GGUF with custom SSM operators.

The Rust inference path is viable today. Train in PyTorch, export safetensors, infer in Rust.

Key Insight: Prediction, Not Retrieval

RecurrentGemma research shows retrieval depends exclusively on attention layers — SSMs show zero compensatory retrieval ability. But our use case is prediction (which memories will be relevant) not retrieval (exact lookup of a specific memory). For prediction tasks, pure SSMs excel. The database handles retrieval. The SSM handles prediction.

title: “SSM Integration Research”

The Integration Map

Where SSMs Replace Heuristics

Pipeline Stage	Current Heuristic	SSM Replacement	Impact
Significance gate	Turn count + substring match + token novelty	Learned significance scorer	~20% fewer false negatives/positives
Importance scoring	`importance * 0.95^ageDays`	Learned decay with data-dependent lambda	Adaptive per-memory, per-user decay
Contradiction detection	34 antonym pairs + temporal markers	Domain-learned contradiction classifier	~40% fewer false contradictions
Decision engine	LLM call for every fact	SSM pre-filter (skip LLM for high-confidence cases)	~30% fewer LLM calls
Traversal scoring	Fixed density weights	Learned aspect/path scoring biased by session trajectory	Query-adaptive traversal
Session context	Stateless per-turn search	SSM hidden state biases focal entity resolution	Proactive context, not reactive search
Retention windows	Fixed 30-day hard delete	Structural-importance-aware soft decay	Memories tied to high-mention entities persist
Summarization priority	All sessions equal	Learned importance weighting	Better signal retention across condensation
Reranking blend	Fixed 70/30 BM25/vector	User-adaptive learned blend	Personalized ranking
Continuity checkpoints	Raw arrays (~2KB)	SSM hidden state (~256 bytes)	8x checkpoint compression

Where SSMs Augment (Not Replace)

Hybrid search (BM25 + vector): Keep as-is. SSM biases what gets searched, not how search works.
Knowledge graph traversal: Keep the graph. SSM biases which paths to explore first.
Session tracking/bypass: Keep the mutex and cleanup. Orthogonal concern.
Embedding generation: Keep Nomic Embed. SSM predicts what to embed proactively.

Alignment with Spec Pipeline

Spec	Status	SSM Fit
Knowledge Architecture (KA-1 to KA-6)	COMPLETE	SSM traversal scoring builds on KA entity/aspect/attribute schema
Predictive Memory Scorer	COMPLETE (bugs)	SSM replaces cross-attention scorer entirely
Session Continuity Protocol	COMPLETE	SSM state replaces raw checkpoint arrays
Desire Paths (DP-1 to DP-10)	PARTIAL	Strongest fit. DP-9/10 (path feedback, path scoring) are inherently SSM problems
Procedural Memory (P1-P5)	NOT STARTED	SSM models skill decay and co-usage patterns
LCM Foundation Patterns	PLANNING	SSM enables temporal reinforcement (LCM pattern extension)
Retroactive Supersession	PLANNING	SSM contradiction classifier replaces heuristic detection

The desire paths epic is the highest-value integration point. Paths through the entity graph are sequential state transitions. Reinforcement comes from feedback. Temporal patterns emerge naturally. This is what SSMs were designed for.

title: “SSM Integration Research”

Proposed Architecture

The Signet Neural Backbone (SNB)

A single Mamba-based SSM sidecar that serves multiple roles through a unified hidden state, replacing the current cross-attention predictor.

Architecture:  Mamba-3 (MIMO, trapezoidal discretization, complex state)
Parameters:    5-10M (4-8 layers, d_model=128-256, N=32-64)
Precision:     Heterogeneous (A matrix fp16, projections int8)
Binary size:   ~10-15MB (Rust, quantized safetensors)
Inference:     <1ms per event, CPU-only
Training:      PyTorch -> safetensors -> Rust inference
State:         Persistent hidden state saved between sessions

Input Features Per Interaction Event

Temporal:
  - time_since_last_event (continuous, irregularly sampled)
  - time_of_day (sin/cos encoded)
  - day_of_week (sin/cos encoded)
  - session_age (time since session start)

Memory access:
  - focal_entity_ids (hashed to fixed-width slots)
  - aspect_ids_touched (hashed)
  - memory_ids_injected (hashed)
  - access_count_delta (how many times each memory was accessed this turn)

Feedback:
  - agent_relevance_signal (from MCP feedback tool, [-1, 1])
  - fts_hit_count (behavioral signal)
  - was_referenced_in_response (implicit positive signal)

Content:
  - query_embedding (768-dim, projected to d_model via learned projection)
  - entity_structural_density (from knowledge graph)

Output Heads (Multi-Task)

From a single forward pass, the SSM produces:

Memory relevance scores: Which memories to inject for next turn
Aspect bias weights: Which aspects to prioritize in traversal
Predicted next entities: For proactive embedding prefetch
Retention adjustment: Per-memory decay rate modifiers
Significance score: Whether current session is worth extracting
Contradiction probability: For each new fact vs existing memories

Data Flow

Session Turn Arrives
       |
       v
  [Embed query] ──────────────────────────────┐
       |                                       |
       v                                       v
  [SSM.step(hidden, features)]          [Hybrid Search]
       |                                       |
       |── aspect_bias ──> [Graph Traversal] <─┘
       |── relevance_scores ──> [Reranking]
       |── predicted_entities ──> [Prefetch Embeddings]
       |── significance ──> [Gate Extraction]
       |── retention_adj ──> [Decay Update]
       |── contradiction_prob ──> [Contradiction Check]
       |
       v
  [Inject Context to Agent]
       |
       v
  [Agent Responds]
       |
       v
  [Collect Feedback] ──> [SSM Training Signal]

The Continuity Loop

The SSM hidden state persists across turns within a session and across sessions via checkpoint serialization:

Session 1, Turn 1: h_0 = init_from_project_defaults()
Session 1, Turn N: h_N = SSM.step(h_{N-1}, features_N)
Session 1, End:    checkpoint(h_N) -> 256 bytes

Session 2, Start:  h_0 = restore_from_checkpoint()
Session 2, Turn 1: h_1 = SSM.step(h_0, features_1)
...

This is the missing continuity mechanism. Current checkpoints store raw arrays of entity IDs and query strings. The SSM state is a learned compression of what actually matters for predicting next-turn relevance.

Graceful Degradation

The SSM is advisory, not authoritative:

SSM unavailable: Skip all SSM-derived biases, use baseline traversal and scoring. No regression.
SSM prediction wrong: Biased aspects still rank in top-N, just with different weights. Worst case is degraded ranking, not broken retrieval.
Hidden state corruption: Fall back to project defaults on checkpoint load.
Inference latency spike: If >5ms, skip and use baseline.

The SSM never blocks the critical path. It biases it.

title: “SSM Integration Research”

Hard Architectural Constraints

Research surfaced several non-negotiable design constraints. Ignoring any of these would produce a system that fails in production.

1. SSMs Cannot Do Associative Recall

Jamba (ICLR 2025) proved removing attention from SSM hybrids drops retrieval accuracy to 0%. This is a hard ceiling. SSMs compress temporal patterns but cannot look up specific facts from their state.

Constraint: The SSM is a prediction/bias layer only. SQLite + FTS5 + vector search handle all retrieval. The SSM tells the retrieval system where to look, not what to find.

2. S4 Outperforms Mamba on Graph Data

GraphSSM (NeurIPS 2024) found that selective Mamba underperforms non-selective S4 on graph-structured inputs. The selectivity mechanism doesn’t help when processing topology.

Constraint: Graph traversal path scoring should use S4 (or a non-selective variant), not Mamba. Sequential interaction modeling uses Mamba. Different SSM variants for different tasks.

3. nDCG Is Anti-Correlated with Online Reward

KDD 2024 proved nDCG shows -0.91 correlation with actual online reward when aggregated across sessions. The normalization introduces inconsistency.

Constraint: Use unnormalized DCG for model selection. Report nDCG for literature comparability only. Never use nDCG to decide whether the SSM is working.

4. LoRA on SSM Parameters Doesn’t Work

LoRA on A, B, C, Delta parameters achieves 76.9 GLUE (near-random). LoRA on linear projections achieves 87.0 (near full fine-tuning at 89.4).

Constraint: Per-user adapters target projection matrices only. SSM state matrices are shared across all users.

5. Memories Must Be Injected Chronologically

GenTKG proved temporal ascending order is the only effective fact arrangement for temporal KG reasoning. Importance-ordered injection destroys the temporal signal the SSM needs.

Constraint: Memory injection order is chronological within each relevance tier, not sorted by importance score.

6. Passive Recall ≠ Agentic Performance

MemoryArena (2026): models scoring near-perfect on passive recall tests dropped to 40-60% on agentic tasks.

Constraint: Evaluation must include end-to-end agent task outcomes, not just retrieval accuracy.

title: “SSM Integration Research”

Validation Contracts

Nothing ships without passing these tests. Tests are written before any production code. The SSM must prove it works on synthetic data with known ground truth before touching the real pipeline.

Phase 0: Synthetic Data Validation

Before any integration, validate the SSM in isolation using synthetic data with planted ground truth patterns.

Canary Pattern Tests

Seven patterns planted in synthetic interaction sequences. The SSM must find all seven to pass.

#	Pattern	Description	Pass Criteria
1	Temporal cycle	Entity X relevant every Monday	Predicted relevance for X on Monday > Thursday by 2x
2	Recency decay	Memories accessed recently score higher	Learned decay rate within 20% of planted rate
3	Entity selectivity	Memory Y only relevant when focal entity is Z	Relevance for Y given Z > relevance for Y given W by 3x
4	Supersession chain	Memory A superseded by B superseded by C	Relevance: C > B > A at all test points
5	Burst pattern	Cluster of accesses predicts near-future relevance	Recall@10 for post-burst queries > 0.8
6	Cross-entity dependency	When entity P active, entity Q’s memories become relevant	Co-activation detected with >0.7 correlation
7	Combined multi-signal	Temporal cycle + entity selectivity + recency	All three signals contribute (ablation: removing any one drops accuracy >10%)

SNR Degradation Curve

Test at signal-to-noise ratios: 30dB, 20dB, 10dB, 0dB, -10dB. Plot accuracy vs SNR. The model must:

Maintain >90% canary detection at 20dB
Maintain >70% canary detection at 10dB
Degrade gracefully (no cliff-edge failure)
The noise floor (where detection drops below 50%) must be documented

Memorization vs Learning Test (SynTSBench Dual-Error)

For each synthetic sequence, measure:

MSE_Obs: predictions vs noisy observations
MSE_True: predictions vs clean ground truth signal

If MSE_True improves while MSE_Obs stays flat, the model is learning temporal patterns. If both move together, it’s memorizing noise. The model must show MSE_True < MSE_Obs on all canary patterns.

TSTR Protocol (Train Synthetic, Test Real)

Once real session data exists:

Train on synthetic data only, test on real data
Train on real data only, test on real data (baseline)
Train on synthetic + real, test on real data

Pass criteria: (1) must achieve >60% of (2)‘s performance. (3) must exceed (2).

Phase 1: Offline Evaluation

Once the SSM is validated on synthetic data, evaluate on historical session data.

Evaluation Protocol

Primary: Global Temporal Split (GTS) with Successive targets. Cutoff at 90th percentile timestamp. All sessions before cutoff are training data. Each post-cutoff interaction is a test case.
Secondary: Leave-One-Out for literature comparability only.
Statistical significance: Paired bootstrap with BCa confidence intervals, minimum 5 random seeds, sign-flip permutation test. Claim significance only if BCa lower bound > 0 AND permutation p < 0.05.

Metrics Suite

Metric	Purpose	Target
HR@5	Did correct memory appear in top-5?	> baseline + 5%
HR@10	Did correct memory appear in top-10?	> baseline + 3%
MRR@10	Position of first correct memory	> baseline + 0.05
DCG@10 (unnormalized)	Model selection metric	> baseline
nDCG@10	Literature comparability only	Report, don’t optimize
Precision@K	Fraction of injected memories that were relevant	> 0.6
Recall@K	Fraction of relevant memories that were injected	> 0.5
Latency p50/p99	Inference time per prediction	p50 < 1ms, p99 < 5ms

Required Ablations

Ablation	What It Tests
No temporal features	Is the SSM using temporal dynamics?
No content features	Is the SSM using content similarity?
No selectivity (A, B, C fixed)	Is Mamba’s selection mechanism helping?
State dim N: 4, 16, 32, 64	Optimal state size for our data
Layer count: 1, 2, 3, 4	Optimal depth
Sequence length: <10, 10-50, 50-200 sessions	Length sensitivity
Temporal feature only (remove content)	Can SSM predict from time alone?
M2Rec FFT validation	Does model respond to daily/weekly frequencies?

Cold-Start Stratified Evaluation

Stratum	Sessions	Target
Ice cold	1-3 sessions	Match baseline (no regression)
Warming	4-10 sessions	Beat baseline by >2% HR@10
Warm	11-50 sessions	Beat baseline by >5% HR@10
Hot	50+ sessions	Beat baseline by >8% HR@10

Document the crossover point: at what session count does the SSM reliably beat the baseline? This is the “cold start disappears” threshold from VISION.md.

Phase 2: Shadow Mode Evaluation

Run the SSM alongside the current system using Signet’s existing shadowMode infrastructure.

SSM predicts top-K memories for each session
Current system injects its own top-K
Compare: do SSM’s top-K receive better agent feedback than current system’s top-K?
Use SNIPS (self-normalized inverse propensity scoring) for counterfactual estimation of what would happen under SSM’s policy

Pass criteria: SSM’s predicted top-K must receive statistically significantly better agent feedback than baseline’s top-K (p < 0.05) over a minimum of 50 sessions.

Phase 3: Online A/B Evaluation

Enable SSM for real memory injection on alternating sessions.

Track: agent feedback signals, task completion, memory hit rate, latency, contradiction rate, staleness distribution
Minimum 100 sessions per arm
Primary metric: agent_relevance_score mean across sessions

Pass criteria: SSM arm must show statistically significant improvement in agent_relevance_score with no regression in latency or contradiction rate.

title: “SSM Integration Research”

Synthetic Data Strategy

Why Synthetic First

We can’t wait for thousands of real sessions to validate the SSM approach. Synthetic data with planted ground truth lets us:

Prove the model learns temporal patterns (canary tests)
Measure the model’s noise floor (SNR curves)
Validate the training pipeline end-to-end
Establish baseline metrics before touching real data

Data Generation Architecture

Generate synthetic interaction sequences that model agent sessions with known temporal patterns.

SyntheticGenerator
  |
  +-- EntityGraph (synthetic knowledge graph)
  |     - 50-200 entities with typed relationships
  |     - Aspects and attributes per entity
  |     - Known dependency chains
  |
  +-- TemporalPatternMixer
  |     - Daily cycles (tod sin/cos)
  |     - Weekly cycles (dow sin/cos)
  |     - Burst patterns (Poisson process)
  |     - Decay curves (exponential + power law)
  |     - Seasonal drift
  |
  +-- SessionSimulator
  |     - 5-50 turns per session
  |     - 1-10 sessions per day
  |     - Focal entity selection from pattern mixer
  |     - Memory candidate generation (relevant + distractors)
  |     - Ground truth labels (which memories are truly relevant)
  |
  +-- NoiseInjector
        - Gaussian noise at configurable SNR
        - Label noise (flip ground truth with probability p)
        - Feature noise (corrupt individual features)
        - Missing data (randomly mask features)

Data Budget

Phase	Synthetic	Real	Total	Purpose
Phase 0	50K sequences	0	50K	Canary validation
Phase 1	50K	~500 sessions	~55K	Bootstrap
Phase 2	25K	~2K sessions	~27K	Mixed training
Phase 3	10K (augmentation)	~5K+ sessions	~15K+	Real-dominant

Hard Negative Generation

Critical for teaching the model to distinguish relevant from nearly-relevant:

Entity-neighbor negatives: Memories from entities adjacent to the focal entity in the graph (structurally similar, semantically wrong)
Temporal-neighbor negatives: Memories from the correct entity but wrong time window (right topic, stale information)
Semantic-neighbor negatives: Memories with high embedding similarity but different entity context
Superseded negatives: Old versions of facts that have been updated

Curriculum Schedule

Start easy, get harder. Based on Pythia findings that curriculum effects matter most for small models.

Stage	Sequence Length	SNR	Hard Negatives	Duration
1	10-20 turns	30dB	0%	20% of training
2	20-50 turns	20dB	25%	30% of training
3	50-200 turns	10dB	50%	30% of training
4	Full range	5dB	75%	20% of training

title: “SSM Integration Research”

Continual Learning Architecture

Per-User Adaptation

Shared base model (Mamba-3, 5-10M params) + per-user LoRA adapters on linear projections (150KB-1.5MB per user).

Base Model (shared, immutable during inference)
  |
  +-- Projection layers with LoRA adapters (per-user)
  |     - in_proj: LoRA rank 8-16
  |     - out_proj: LoRA rank 8-16
  |     - dt_proj: LoRA rank 4-8
  |     - ~150KB per user at rank 8
  |     - ~1.5MB per user at rank 16
  |
  +-- SSM state matrices A, B, C (shared, NOT adapted)
  |     - Lyapunov stability guaranteed
  |     - Perturbations decay, not amplify
  |
  +-- Hidden state (per-session, persisted across turns)
        - Checkpoint at session end
        - Restore at session start

Session-Level TTT (Test-Time Training)

TTT4Rec shows meaningful adaptation from 5-10 interactions. Per-session:

At session start, initialize LoRA deltas from stored adapter
During session, accumulate self-supervised loss on predictions vs actual memory usage
At session end, apply one gradient step to LoRA adapter
Store updated adapter weights

Computational cost: one backward pass per session (~3.4x forward cost). Amortized across a full session, this is negligible.

Daemon Lifecycle Mapping (SleepGate-Inspired)

Biological Phase	Daemon Phase	SSM Activity
Wake	Active sessions	Forward inference, accumulate gradients
NREM (consolidation)	Idle period (no active sessions)	Apply gradient updates, compress replay buffer
REM (exploration)	Scheduled maintenance	Synthetic data exploration, canary re-validation

Trigger consolidation based on parameter-space drift (FOREVER metric), not clock time. The SSM retrains when its predictions have drifted from observed outcomes, not on a fixed schedule.

Drift Detection

Trinity-Controller ADWIN: fuse volatility, adaptive sensitivity, and accuracy EMA into a unified detector. When drift exceeds threshold:

Increase LoRA rank temporarily (more adaptation capacity)
Trigger NREM consolidation pass
If drift persists after consolidation, trigger full fine-tune cycle

Federated Learning Path

Per-user LoRA deltas (150KB-1.5MB) are the communication payload. With differential privacy applied at the client before transmission:

Clip LoRA gradients to norm bound
Add calibrated Gaussian noise
Transmit delta to aggregation server
Server averages deltas into base model update
Ship updated base model with next Signet release

title: “SSM Integration Research”

Training Data Inventory

Existing tables that feed SSM training (from codebase exploration):

Table	Rows/Session	Temporal Resolution	Ground Truth Signal
predictor_training_pairs	50-200	Per-session	combined_label [-1, 1]
session_memories	200-1000	Per-session (fts_hit_count per-prompt)	agent_relevance_score, was_injected
session_scores	1	Per-session	continuity score + confidence
predictor_comparisons	1	Per-session	predictor_won, NDCG, margin
session_checkpoints	5-20	Periodic within session	focal_entities, active_aspects
memory_history	5-50	Per-event	event type, old/new content
memory_entity_mentions	Per-memory	Per-extraction	entity appearance timeline

Critical Data Gaps

These gaps must be addressed before real-data training:

No turn-by-turn ranking snapshots: Add per-prompt telemetry recording which memories were considered and their rank order. New table: prompt_memory_rankings(session_key, prompt_index, memory_id, rank, score, was_injected).
No query embeddings persisted: Store the query embedding for each prompt. Privacy concern mitigated by storing only the embedding vector (768 floats), not the raw text.
No explicit negative labels: Derive from was_injected=0 + fts_hit_count=0 + agent_relevance_score is NULL. Also generate hard negatives during training (entity-neighbor, temporal-neighbor).
No inter-session memory chains: Add a memory_session_links table tracking which memories were relevant across consecutive sessions for the same project.

Implementation Strategy

Training Pipeline

Phase 0: Build synthetic data generator. Run canary validation tests. Prove the SSM learns temporal patterns in isolation. No production code changes.
Phase 1: Pre-train base model on synthetic data. Validate on synthetic held-out set. Pass all canary tests and SNR curves.
Phase 2: Add turn-level telemetry to daemon (new migration for prompt_memory_rankings). Collect real data for 2-4 weeks.
Phase 3: Fine-tune on real data (TSTR protocol). Pass offline evaluation metrics on GTS split.
Phase 4: Shadow mode deployment. SSM predicts alongside current system. Compare feedback signals.
Phase 5: Online deployment with A/B evaluation.
Phase 6: Federated base model from opt-in community signals.

Sidecar Architecture

Extend the existing predictor/ Rust crate:

predictor/
  src/
    main.rs          -- JSON-RPC server (already exists)
    model.rs         -- Replace CrossAttentionScorer with SSM
    ssm/
      mamba.rs       -- Mamba-3 forward pass (interaction modeling)
      s4.rs          -- S4 forward pass (graph traversal scoring)
      state.rs       -- Hidden state persistence + checkpoint
      adapter.rs     -- Per-user LoRA adapter management
    training.rs      -- Online fine-tuning loop
    synthetic.rs     -- Synthetic data generator (for validation)
    protocol.rs      -- Extended input/output types
    canary.rs        -- Built-in canary pattern tests

The existing RPC interface between daemon and sidecar remains. The sidecar includes built-in canary tests that run on startup to verify the model hasn’t degraded.

title: “SSM Integration Research”

Expected Impact

Measurable Targets (Must Hit to Proceed Between Phases)

Gate	Metric	Target	Measured On
Phase 0 -> 1	Canary detection rate	7/7 patterns at 20dB SNR	Synthetic data
Phase 1 -> 2	TSTR performance ratio	>60% of real-only baseline	Synthetic train, real test
Phase 2 -> 3	HR@10 improvement	>3% over baseline (p < 0.05, 5 seeds)	GTS split on real data
Phase 3 -> 4	Shadow mode feedback	SSM top-K feedback > baseline top-K (p < 0.05)	50+ shadow sessions
Phase 4 -> 5	Online agent_relevance	Significant improvement, no latency regression	100+ A/B sessions

Per-Session (Post Phase 5)

Metric	Current	With SSM	How We’ll Know
LLM calls per session	8-10	3-4	Telemetry counter
Per-turn latency	50-100ms	30-50ms	p50/p99 latency histogram
Checkpoint size	~2KB	~256 bytes	Measured per checkpoint
Cold-start crossover	Never (no learning)	~10 sessions	Stratified evaluation
Memory injection precision	Unknown	>0.6	agent_relevance on injected memories

Cost

Binary size: ~2MB (1.58-bit quantization) to ~15MB (int8)
Runtime memory: ~20-30MB (model + state + adapter)
CPU overhead: <1ms per turn (SSM inference)
Per-user adapter: 150KB-1.5MB
Training: session-end gradient step + periodic consolidation

title: “SSM Integration Research”

Risk Assessment

Technical Risks

SSM training signal quality: Continuity scorer labels are noisy. Mitigation: Validate labels against behavioral signals (FTS overlap, session gaps). Use confidence-weighted loss. Start with highest-confidence subsets. Synthetic canary tests prove the model works before real data.

Feedback loop instability: SSM could converge to local optima, reinforcing mediocre paths. Mitigation: Thompson sampling (noise in path scores). Periodic exploration resets on low-confidence edges. FOREVER-style drift detection triggers re-exploration when predictions diverge from observations.

State representation explosion: Large hidden states are expensive to serialize. Mitigation: Mamba-3’s complex-valued state achieves same quality with half the dimensions. 1.58-bit quantization fits 10M params in 2MB.

Rust SSM maturity: Rust Mamba implementations are early-stage. Mitigation: Start with S4D (simplest, proven at 200K params in JAX, straightforward to port). Upgrade to Mamba-3 once Rust tooling matures. web-rwkv (RWKV-7) is the most production-ready Rust alternative. mamba.rs and oxidizr/blazr provide Mamba-specific paths.

S4 vs Mamba for graphs: GraphSSM shows S4 outperforms Mamba on graph data. Mitigation: Use S4 for traversal path scoring, Mamba for sequential interaction modeling. Two small models, not one.

Scope Risks

Over-engineering: Multiple SSM integration points is ambitious. Mitigation: Phase gates with measurable targets. Nothing proceeds without passing validation contracts. Each phase is independently deployable/revertable.

Data gap: No turn-by-turn ranking data exists today. Mitigation: Phase 2 adds telemetry before fine-tuning. Synthetic data validates the approach while real data accumulates.

Evaluation trap: Optimizing offline metrics that don’t predict online improvement. Mitigation: Use unnormalized DCG (0.97 online correlation), not nDCG (-0.91 correlation). Shadow mode and A/B testing before production deployment.

title: “SSM Integration Research”

Proof of Concept Results (2026-03-20)

Standalone benchmark at packages/predictor/bench/ssm_proof_of_concept.py. No production code changes. Selective SSM (Mamba-style, 50K params) vs MLP baseline (3.2K params) vs production heuristic on synthetic data matching the exact 17-dim feature layout from protocol.rs.

Head-to-Head (hand-crafted synthetic, 2000 sequences)

Metric	Heuristic	MLP	SSM	SSM vs Heuristic
HR@5	0.545	0.574	0.724	+32.8%
MRR@5	0.312	0.323	0.475	+52.2%
DCG@5	0.754	0.784	1.009	+33.7%

Canary Pattern Detection (SSM)

Pattern	HR@5	Status
recency_bias	0.920	PASS
importance_threshold	0.460	FAIL
access_frequency	0.770	PASS
temporal_coherence	0.910	PASS
entity_clustering	0.930	PASS
supersession_filter	0.720	PASS
graph_traversal_priority	0.550	PASS

The importance_threshold failure is expected — SSMs learn temporal and relational patterns, not pointwise scalar thresholds. This validates the multi-head architecture: let a simple threshold head handle importance while the SSM handles what it’s good at.

LLM-Generated Synthetic Data

Used generate_scenarios.py to create training data via local LLM (gpt-oss:20b, then qwen3:8b). The LLM generates behavioral scenarios (narratives with metadata), converted deterministically to 17-dim feature vectors.

Comparison on held-out test set (neutral ground, different seed):

Training Source	Sequences	HR@5	Gen Gap	Canary
hand-crafted	2000	0.587	0.094	7/7
LLM-generated	29	0.556	-0.036	6/7
combined	2029	0.577	0.150	6/7
heuristic	—	0.545	—	—

Key finding: 29 LLM-generated scenarios produced better generalization than 2000 hand-crafted sequences. The negative gen gap means the model performed better on unseen data than on training data — zero memorization. LLM scenarios encode behavioral diversity that numpy distributions cannot.

The combined model underperformed because hand-crafted data dominates by volume (2000 vs 29) and reintroduces memorization. With 200+ LLM scenarios, combined should win on both metrics.

Inference Latency

Model	p50	p95	p99
SSM (50K)	3.19ms	4.42ms	5.38ms
MLP (3.2K)	0.17ms	2.35ms	2.35ms

P95 at 4.42ms on GPU (sequential scan). With parallel scan implementation, expect sub-1ms. Production Rust sidecar with CUDA will be faster still.

Phase 0 Gate Determination

All four gates passed:

HR@K >= 0.60: 0.724 (PASS)
DCG improvement >= 10%: 33.7% (PASS)
P95 latency <= 5ms: 4.42ms (PASS)
Canary pass >= 5/7: 6/7 (PASS)

Verdict: SSM is viable for Signet memory prediction.

Analysis and Immediate Improvements

Research into the PoC results surfaced several actionable findings:

Why importance_threshold failed (0.46 HR). The SSM processes candidates as a sequence through Conv1d and selective state transitions — all machinery for modeling relationships between positions. A pointwise threshold (“is feature[1] > 0.7?”) doesn’t benefit from any of this. MambaTab (PMC 2024) confirms SSMs process across tokens, not across features within a token. Fix: add a parallel pointwise MLP branch (~2K params) before the SSM layers. The pointwise branch handles threshold decisions; the SSM handles temporal and relational patterns. Both concatenate into the readout heads.

The negative gen gap (-3.6%) is expected and healthy. DR4SR (KDD 2024 Best Student Paper) showed quality and diversity matter more than quantity — their regenerated datasets with fewer but more diverse interactions improved performance 5-43% across 5 architectures. The 15% gap on hand-crafted data is the real problem: 7 rigid patterns with fixed seeds create exploitable regularity. The combined dataset (gen gap 0.150) suffered because hand-crafted data dominates by volume and reintroduces memorization.

50K params is right-sized for 500+ scenarios, oversized for 29. At 29 scenarios (580 observations), the 86:1 param-to-observation ratio is severely overparameterized. MambaTab defaults to 13-15K params for tabular tasks. Mamba4Rec uses embed_dim=64, state_expansion=32 on 132K-999K interactions. Immediate fix: increase weight_decay from 1e-4 to 0.01 (100x, per Mamba small-dataset recommendations). For 500+ scenarios, 50K params is in the sweet spot.

Switch BCE to softmax cross-entropy (listwise loss). The current binary_cross_entropy_with_logits treats each candidate independently — a pointwise loss that ignores ranking structure. Bruch et al. (SIGIR 2019) proved softmax CE is a convex bound on both MRR and nDCG. With only 20 candidates, full softmax is trivial. Expected: significant improvement on MRR and DCG metrics.

Fix the 52% relevance ratio. Two approaches: (1) rejection sampling in scenario_to_training_pair — skip scenarios where n_relevant / n_candidates > 0.45; (2) hard negative injection — for each relevant candidate, programmatically add a near-miss candidate with similar entity_slot/recency but labeled irrelevant.

Priority order for next PoC iteration:

Softmax CE loss (30 min, highest impact)
Rejection sampling + hard negatives in generator (1 hr)
Increase weight_decay to 0.01 (1 line)
Scale LLM data to 500+ scenarios (overnight run)
Add parallel pointwise branch (fixes importance_threshold)
Curriculum training schedule (easy -> hard patterns)
Multi-task auxiliary losses on significance/retention heads

title: “SSM Integration Research”

Desire Paths Integration Map

The SSM research converges with the desire paths epic at Phase 4 (Path Learning). Here is the precise integration point and build sequence.

Current Progress (as of 2026-03-20)

DP Phase	Stories	Status
P1: Foundation	DP-1 through DP-4	COMPLETE
P2: Topology	DP-5 (Leiden)	COMPLETE
P3: Graph-Native	DP-6, 6a, 7	COMPLETE (DP-6 FTS5 entity resolution remaining)
P4: Path Learning	DP-8, 9, 10, 11	DP-8 COMPLETE, rest NOT STARTED
P5: Emergence	DP-12 through 15	NOT STARTED

Where the SSM Slots In

DP-9 (Path Feedback Propagation) generates the SSM’s training signal. Currently, feedback is memory-level (“was this memory useful?”). DP-9 upgrades this to path-level (“was this traversal path useful?”). Paths are sequences of hops — exactly the data structure SSMs process. Without DP-9, the SSM trains on memory-level labels. With DP-9, it trains on path-level labels where its temporal modeling excels.

DP-10 (Path Scoring) is the primary integration point. The spec says “the predictor evolves from a memory ranker to a path scorer.” The SSM replaces the CrossAttentionScorer for this task. Path-level features defined in DP-10 (hop count, min edge confidence, average aspect weight, community boundary crossing) extend the current 17-dim feature vector with graph-structural signals. The SSM processes the sequence of hops in a path and scores the whole trajectory.

DP-11 (Temporal Reinforcement) is the SSM’s natural strength. “Which paths matter at which times” is a temporal prediction problem. The Mamba-style input-dependent gating decides which temporal signals to keep in state and which to forget. The PoC showed 0.920 HR@5 on recency and 0.910 on temporal coherence — this is what SSMs do.

DP-12 (Explorer Bees) in Phase 5 becomes informed exploration with the SSM. Instead of random speculative traversals, the SSM predicts which unfamiliar paths are likely to yield useful results based on learned temporal and structural patterns.

Proposed Build Sequence

Current: DP-6 remaining (FTS5 entity resolution)
    |
    v
DP-9: Path feedback propagation
    |   - Tags injected context with traversal path provenance
    |   - Positive/negative feedback propagates along path edges
    |   - Stores path feedback history for SSM training data
    |
    v
SSM Phase 0: Synthetic validation (COMPLETE -- see PoC results above)
    |
    v
SSM Phase 1: Path feature extension
    |   - Extend 17-dim vector with path-level features
    |   - Add prompt_memory_rankings telemetry table
    |   - Pre-train base model on synthetic + LLM-generated data
    |
    v
DP-10: Path scoring with SSM
    |   - Replace CrossAttentionScorer with SSM in Rust sidecar
    |   - S4 variant for graph path scoring (per GraphSSM constraint)
    |   - Mamba variant for temporal interaction modeling
    |   - Shadow mode evaluation against current system
    |
    v
DP-11: Temporal reinforcement
    |   - SSM learns temporal patterns from path feedback
    |   - Pre-warming: predict paths needed before query arrives
    |   - Per-user LoRA adapters for personalization
    |
    v
DP-12+: Emergence (SSM-guided exploration)

Hard Constraint: S4D for Paths, Mamba for Sequences

GraphSSM (NeurIPS 2024) tested S4, S5, and S6 (Mamba) as sequence backbones within a unified GNN+SSM framework. Results:

Reddit: S4 49.21% >> S5 44.75% >> Mamba 43.11%
DBLP-10: S4 76.80% > S5 75.19% > Mamba 74.09%

The paper states: “the selective mechanism may not be a good fit for graph data.” Mamba’s input-dependent state transitions are powerful for language (filter irrelevant tokens) but counterproductive for graph topology where every node’s position matters for preserving structure. S4’s fixed state transitions act as stable structural filters.

Additionally, GraphSSM uses Laplacian regularization — a quadratic term that compresses both feature dynamics and topological structure simultaneously. This is graph-aware denoising via the graph Laplacian, not just sequential ordering.

DyGMamba provides direct precedent for dual-SSM architecture. It uses a node-level SSM (neighbor interaction sequences) and a time-level SSM (edge-specific temporal patterns) as two separate Mamba blocks. The time-level SSM output dynamically selects relevant information from the node-level SSM via softmax attention-like weights.

Other precedents: I2I-Mamba (dual-domain SSM with cross-domain state coupling), Coupled Mamba (multi-modal fusion where each modality has its own SSM chain with inter-modal hidden state transitions), and Mamba-3 MIMO (single SSM with multiple input/output channels).

Path Feature Extension (17-dim -> 24-dim)

Based on DyGMamba’s edge encoding and NeuralWalker’s walk embedding:

Dim	Feature	Rationale
17	`hop_count` (normalized /10)	Path length, strongest structural signal
18	`min_edge_confidence`	Weakest-link bottleneck quality
19	`avg_aspect_weight`	Central vs peripheral path
20	`community_boundary_crossings` (norm)	Cross-domain reach
21	`path_feedback_score`	Historical path success rate
22	`log(dependency_strength_product)`	Strong vs weak conceptual links
23	`focal_distance` (normalized)	Distance from seed entity (0=focal)

NeuralWalker’s walk embedding formula: h_W[i] = h_V(w_i) + proj_edge(h_E(w_i, w_{i+1})) + proj_pe(h_pe[i]) — node embedding + projected edge embedding + positional encoding. For our case, the 24-dim feature vector already encodes per-hop signals; the S4D block processes the sequence of hops.

DyGMamba encodes edge features via MLP projection to hidden dim, concatenated with node features and temporal encoding. Our path-level features (dims 17-23) concatenate directly with the existing per-memory features and need no separate MLP if pre-normalized.

Parameter Sizing

S4D parameter formula per layer:

A matrix (diagonal complex): 2N real params
B, C matrices: 2HN each (complex)
D skip connection: H params
Input/output projections: feature_dim * H + H * output_dim

Reference: S4D at H=256, N=64, 6 layers achieves 88% on sequential CIFAR (length-1024). Our paths are 3-10 hops with 24-dim features — far simpler.

Config	H	N	Layers	Params	Use Case
Tiny	32	16	2	~15K	Minimum viable path scorer
Small	64	32	3	~60K	Sweet spot for 24-dim features
Medium	64	64	4	~120K	Complex inter-hop patterns

Recommendation: Start with a single S4D block at Small config (60K params). Add the Mamba behavioral block only if ablations show temporal features are underweighted by S4D alone. Dual architecture total budget: ~100K params. This is well under FlowState (9.1M) because we have a fixed graph schema (entity -> aspect -> attribute -> dependency) with short paths and pre-computed features.

Sidecar Architecture (Refined)

predictor/
  src/
    ssm/
      s4d.rs         -- S4D diagonal forward pass (path scoring)
      mamba.rs       -- Mamba selective forward pass (behavioral)
      combiner.rs    -- Gated addition of SSM outputs
      state.rs       -- Hidden state persistence + checkpoint
      adapter.rs     -- Per-user LoRA adapter management

S4D’s diagonal form requires only element-wise complex multiplication for the recurrence — no CUDA kernels needed for paths this short. Implementable in pure Rust using the existing autograd infrastructure.

Integration with Existing Specs

Spec	SSM Integration Point
`predictive-memory-scorer`	SSM replaces CrossAttentionScorer entirely
`desire-paths-epic` DP-9	Provides path-level training signal
`desire-paths-epic` DP-10	Primary integration: SSM scores paths
`desire-paths-epic` DP-11	SSM temporal modeling
`desire-paths-epic` DP-12	SSM guides exploration
`knowledge-architecture-schema`	SSM consumes KA structural features
`session-continuity-protocol`	SSM state replaces raw checkpoint arrays
`procedural-memory-plan`	SSM models skill decay + co-usage
`predictor-agent-feedback`	MCP feedback tool feeds SSM training

What Doesn’t Change

The SSM is additive to the existing architecture. It does not replace:

The knowledge graph (entities, aspects, attributes, dependencies)
FTS5 keyword search or vector similarity search
The traversal algorithm in graph-traversal.ts
The memory pipeline extraction/decision stages
SQLite storage or the daemon HTTP API

The graph provides structure. The SSM provides learning. They compose.

title: “SSM Integration Research”

Open Questions

Complex-valued state for cyclical patterns: Mamba-3’s complex state models oscillations naturally. Worth prototyping for daily/weekly usage patterns. How much does this help vs real-valued state? Testable: plant daily/weekly canary patterns, compare complex vs real state detection rates.
MIMO for multi-task prediction: Can MIMO SSMs predict relevance, retention, and significance from a single forward pass without task interference? Testable: compare multi-head MIMO vs separate models on each task.
CausalMamba for retention: Can differentiable causal graph learning discover why memories become relevant, enabling causal-ancestor-aware decay? Memories that cause other memories to be retrieved should decay slower. Novel research direction.
SSM state as identity embedding: Could the persistent hidden state serve as a compact representation of agent personality/preferences, complementing SOUL.md? An agent’s “feel” encoded in 256 floats.
Bi-temporal data model for contradictions: Zep/Graphiti uses 4 timestamps per edge (valid_from, valid_to, created_at, expired_at). Should we adopt this for entity_dependencies instead of binary supersession? The SSM could learn temporal validity windows.
DG-Mamba PRI regularization for entity bloat: Learned pruning of inactive graph regions. Could replace our manual entity count thresholds (current: 43,520 entities, target <1,000).
Publishable research: Five novel gaps identified where Signet could contribute original work to the SSM literature: SSM contradiction detection, SSM belief revision, SSM identity coherence, privacy-preserving SSM memory, federated SSM weight aggregation.
Dual-SSM combiner design: The S4 (paths) + Mamba (sequences) dual architecture needs a combiner. Options: learned weighted sum, gated mixture, cross-attention between the two state vectors. What’s the simplest approach that doesn’t introduce a third model? Can the two SSM outputs simply concatenate into the readout head input?
Path feature dimensionality: DP-10 defines path features (hop count, min confidence, avg aspect weight, community crossings, historical feedback). How many dimensions does this add? The current 17-dim per-memory vector needs a per-path extension. Should paths be encoded as sequences of per-hop features (variable length) or as a fixed-length path summary vector?
LLM-generated data scaling: The PoC showed 29 LLM scenarios outperform 2000 hand-crafted on generalization. What’s the curve? Does 200 LLM scenarios beat 200 hand-crafted on raw metrics too? Is there a point of diminishing returns? The relevance ratio from qwen3:8b was 52% (target 15-40%) — tightening this should improve hard negative coverage.

title: “SSM Integration Research”

References

See companion documents for full citations:

Round 1 (foundational research):

SSM-LITERATURE-REVIEW.md — 26 papers: HiPPO, S4, H3, Hyena, Mamba 1/2/3, TTT, xLSTM, FlowState, SleepGate, memory compression formalism, sequential recommendation
ssm-implementations-survey.md — 20+ repositories: mamba.rs, oxidizr/blazr, web-rwkv, llama.cpp Mamba, Candle, production deployments (Jamba, Nemotron-H, Bamba, Falcon Mamba)

Round 2 (validation, depth, novel applications):

SSM-NOVEL-APPLICATIONS.md — 35+ papers: CausalMamba, Drama (7M world model), 1.58-bit Mamba, MPS-SSM (minimal predictive set), FOREVER (drift-based retraining), Jamba retrieval constraint, KambaAD anomaly detection
SSM-GRAPH-INTERSECTION.md — 30+ papers: DyGMamba/DyG-Mamba, NeuralWalker, HeteGraph-Mamba, GraphSSM (S4 > Mamba on graphs), GenTKG (chronological ordering), HiSS (hierarchical SSM), Zep/Graphiti bi-temporal model
SSM-CONTINUAL-LEARNING-DEEP-DIVE.md — TTT/TTT4Rec, Mamba-CL (null-space projection), Inf-SSM (Grassmannian regularization), LoRA effectiveness on SSMs (projections not matrices), SleepGate (99.5% on proactive interference), WSCL (wake/NREM/REM), FedSSM, FOREVER drift detection
SYNTHETIC-DATA-GENERATION.md — DR4SR (KDD 2024 best paper), RecSim NG, SynTSBench, curriculum learning (Pythia), EntiGraph entity expansion, slide-window augmentation, FENRec hard negatives, TSTR protocol

Proof of concept and tooling:

packages/predictor/bench/ssm_proof_of_concept.py — standalone benchmark (SSM vs MLP vs heuristic, canary suite, SNR curves, dual-error test, latency benchmark, Phase 0 gates). Supports --data for LLM-generated JSONL, --compare for side-by-side.
packages/predictor/bench/generate_scenarios.py — LLM-based synthetic data generator. 14 pattern types, curriculum weighting, schema validation. Outputs JSONL matching 17-dim feature layout.

Round 3 (path scoring, PoC analysis, integration mapping):

GraphSSM S4>Mamba on graphs, NeuralWalker walk embeddings, DyGMamba dual-SSM precedent, GrassNet spectral filtering, Walk&Retrieve zero-shot RAG, path feature extension (17->24 dim), parameter sizing
PoC analysis: importance_threshold structural failure, DR4SR diversity findings, softmax CE loss, MambaTab parameter scaling, hard negative mining, curriculum learning for small SSMs

Key papers for this synthesis:

Gu & Dao, “Mamba” (2023) — selective state spaces
Gu & Dao, “Mamba-2” (2024) — state space duality
Lahoti et al., “Mamba-3” (2026) — inference-first design
IBM FlowState (2025) — 9.1M param SSM beats 200M+ models
Xie, “SleepGate” (2026) — sleep-inspired memory consolidation
Bhat, “Memory Compression in Selective SSMs” (2024) — theoretical bounds
Gu, “Tradeoffs of SSMs and Transformers” (2025) — brain vs database
GraphSSM (NeurIPS 2024) — S4 outperforms Mamba on graph data
Jamba (ICLR 2025) — attention required for associative recall (0% without)
KDD 2024 — nDCG anti-correlated with online reward (-0.91)
MemoryArena (2026) — passive recall ≠ agentic performance
DyG-Mamba — Ebbinghaus forgetting as learned SSM decay
NeuralWalker — bidirectional Mamba SOTA on graph walks
SS4Rec / DyGMamba — sequential recommendation + temporal KG reasoning
TTT4Rec — meaningful adaptation from 5-10 interactions
DR4SR (KDD 2024 best paper) — synthetic data regeneration
“When +1% Is Not Enough” — BCa bootstrap significance protocol
NeuralWalker (ICLR 2025) — bidirectional Mamba SOTA on graph walks, walk embedding formula
GrassNet (2024) — SSMs as GNN spectral filters
Walk&Retrieve (IR-RAG 2025) — walk-based graph traversal for zero-shot RAG
Coupled Mamba (2024) — multi-modal SSM fusion pattern
I2I-Mamba (2024) — dual-domain SSM with cross-domain state coupling
Graph Mamba Networks (KDD 2024) — bidirectional Mamba scanning for graphs
MambaTab (PMC 2024) — 13-15K params for tabular tasks, SSM sequential bias
Mamba4Rec (2024) — embed_dim=64, state=32 for sequential recommendation
Bruch et al. (SIGIR 2019) — softmax CE is convex bound on MRR and nDCG
DR4SR (KDD 2024) — diversity matters more than quantity for synthetic data
Diff4Rec (2023) — curriculum-scheduled diffusion augmentation
Contrastive Curriculum Learning (CIKM 2021) — easy-to-hard ordering

State-Space Models for Signet: Research Synthesis

The Thesis

Current Pipeline: The Heuristic Map

Significance Gate (significance-gate.ts)

Extraction (extraction.ts)

Decision Engine (decision.ts)

Contradiction Detection (contradiction.ts)

Importance Scoring (hooks.ts:effectiveScore)

Graph Traversal (graph-traversal.ts)

Session Hook Flow (hooks.ts:handleUserPromptSubmit)

Predictive Scorer (Rust sidecar, predictor/)

Retention (retention-worker.ts)

Summarization (summary-worker.ts, summary-condensation.ts)

Continuity State (continuity-state.ts)

What the Literature Says

The Core Architecture: Mamba

The Brain-Database Analogy

Proven at Our Scale

Inference Performance

Rust Ecosystem

Key Insight: Prediction, Not Retrieval

The Integration Map

Where SSMs Replace Heuristics

Where SSMs Augment (Not Replace)

Alignment with Spec Pipeline

Proposed Architecture

The Signet Neural Backbone (SNB)

Input Features Per Interaction Event

Output Heads (Multi-Task)

Data Flow

The Continuity Loop

Graceful Degradation

Hard Architectural Constraints

1. SSMs Cannot Do Associative Recall

2. S4 Outperforms Mamba on Graph Data

3. nDCG Is Anti-Correlated with Online Reward

4. LoRA on SSM Parameters Doesn’t Work

5. Memories Must Be Injected Chronologically

6. Passive Recall ≠ Agentic Performance

Validation Contracts

Phase 0: Synthetic Data Validation

Canary Pattern Tests

SNR Degradation Curve

Memorization vs Learning Test (SynTSBench Dual-Error)

TSTR Protocol (Train Synthetic, Test Real)

Phase 1: Offline Evaluation

Evaluation Protocol

Metrics Suite

Required Ablations

Cold-Start Stratified Evaluation

Phase 2: Shadow Mode Evaluation

Phase 3: Online A/B Evaluation

Synthetic Data Strategy

Why Synthetic First

Data Generation Architecture

Data Budget

Hard Negative Generation

Curriculum Schedule

Continual Learning Architecture

Per-User Adaptation

Session-Level TTT (Test-Time Training)

Daemon Lifecycle Mapping (SleepGate-Inspired)

Drift Detection

Federated Learning Path

Training Data Inventory

Critical Data Gaps

Implementation Strategy

Training Pipeline

Sidecar Architecture

Expected Impact

Measurable Targets (Must Hit to Proceed Between Phases)

Per-Session (Post Phase 5)

Cost

Risk Assessment

Technical Risks

Scope Risks

Proof of Concept Results (2026-03-20)

Head-to-Head (hand-crafted synthetic, 2000 sequences)

Canary Pattern Detection (SSM)

LLM-Generated Synthetic Data

Significance Gate (`significance-gate.ts`)

Extraction (`extraction.ts`)

Decision Engine (`decision.ts`)

Contradiction Detection (`contradiction.ts`)

Importance Scoring (`hooks.ts:effectiveScore`)

Graph Traversal (`graph-traversal.ts`)

Session Hook Flow (`hooks.ts:handleUserPromptSubmit`)

Predictive Scorer (Rust sidecar, `predictor/`)

Retention (`retention-worker.ts`)

Summarization (`summary-worker.ts`, `summary-condensation.ts`)

Continuity State (`continuity-state.ts`)