Deep Memory Search (Agentic Escalation)
Spec metadata:
- ID:
deep-memory-search - Status:
planning - Hard depends on:
desire-paths-epic,ssm-foundation-evaluation - Registry:
docs/specs/INDEX.md
1) Problem
Primary retrieval (graph traversal + hybrid vector/keyword) works well when the query maps to embedded content. But some queries fail because the connection is semantic, not lexical or geometric: the memory exists but its embedding is distant and keywords don’t overlap. LoCoMo experiments showed retrieval was the bottleneck in most failures.
Supermemory demonstrates that an LLM can read and compare memories directly, finding connections embedding similarity misses. The tradeoff is cost/latency (10-100x slower). The solution: escalation when primary confidence is low.
2) Goals
- Define a confidence threshold that triggers deep search escalation.
- Use an LLM to read, compare, and rank candidate memories directly.
- Keep deep search optional and off by default (opt-in via config).
- Bound latency and cost with token budgets and candidate limits.
- Feed deep search outcomes back to the scorer as training signal.
3) Proposed capability set
A) Confidence-based escalation trigger
After primary retrieval, compute confidence from top-score magnitude,
rank-1-to-rank-2 gap, and result count. Below a configurable threshold
(default: 0.3, exposed at memory.deepSearch.threshold in agent.yaml),
the query escalates to deep search.
B) Candidate expansion
Expands beyond primary results via three strategies: temporal neighbors
(same/adjacent sessions), entity neighbors (one-hop via entity_dependencies),
and community members (same Louvain cluster). Pool capped at configurable
limit (default: 50).
C) LLM comparison and ranking
The LLM receives the query and candidate pool, returns a ranked list with
relevance scores and reasoning. Focuses on semantic connection, temporal
relevance, and contradiction detection. Uses pipeline extraction provider
by default; override via memory.deepSearch.model.
D) Result merging
Weighted blend with primary results. Constraint memories from deep search always surface (cross-cutting invariant). Duplicates use the higher score.
E) Cost and latency bounds
Token budget (default: 8192), timeout (default: 10s, returns primary results on abort), and per-session rate limit (default: 5 invocations).
F) Scorer feedback loop
Deep search hits produce training pairs: negative for primary path (missed memory), positive for expansion features used. Over time the scorer learns to rank these memories higher, reducing deep search invocation rate.
4) Non-goals
- No replacement of primary retrieval (deep search is supplementary only).
- No real-time indexing changes (deep search is read-path only).
- No custom embedding model training.
- No cross-agent deep search (agent_id scoping preserved).
5) Integration contracts
Deep Search <-> Desire Paths
- Deep search consumes graph traversal results and entity dependency edges for candidate expansion.
- Constraint-surfacing invariant applies to deep search results.
- Post-fusion dampening (DP-16) applies before the confidence check.
Deep Search <-> SSM Foundation
- SSM temporal scoring can inform candidate expansion ordering.
- Deep search outcomes feed SSM training as temporal relevance signal.
- SSM evaluation harness includes deep search ablation runs.
Deep Search <-> Predictive Scorer
- Deep search hits/misses produce scorer training pairs.
- Scorer feature vector gains a
deep_search_eligibleboolean dimension. - As scorer improves, deep search invocation rate should decline.
6) Rollout phases
Phase 1 (shadow mode)
- Deep search runs in shadow mode: executes but does not merge results.
- Logs deep search candidates, LLM rankings, and comparison to primary results.
- Confidence threshold tuned from shadow data.
- No user-visible behavior change.
Phase 2 (opt-in active)
memory.deepSearch.enabled: truein agent.yaml activates result merging.- Token budget, timeout, and rate limit enforced.
- Dashboard shows deep search invocation count and hit rate.
Phase 3 (scorer-driven adaptive threshold)
- Confidence threshold adjusted dynamically by the scorer based on historical deep search hit rate per query category.
- Deep search invocations decrease as scorer improves.
7) Validation and tests
- Deep search only triggers when confidence is below threshold.
- Candidate pool respects configured cap.
- Timeout aborts cleanly, returns primary results.
- Rate limit enforced per session. Agent_id on all expansion queries.
- Constraint memories from deep search surface in final results.
- Shadow mode produces logs without changing returned results.
8) Success metrics
- Recovers at least 30% of memories primary retrieval misses on LoCoMo.
- Average latency under 5s (p95 under 10s).
- Invocation rate decreases over time as scorer improves.
- Zero agent_id scoping violations in candidate expansion.
9) Open decisions
- Whether the LLM prompt includes entity graph context or just raw content.
- Whether to batch multiple low-confidence queries in one LLM call.
- Whether deep search is exposed as MCP tool or internal escalation only.