Benchmarks
WARNING: NEVER run benchmarks on your production memory database. The ingestion phase creates hundreds of test memories that will permanently pollute your real data. Always use an isolated environment.
LoCoMo Benchmark
LoCoMo (Long-form Conversational Memory) is the primary benchmark for evaluating Signet’s recall quality. It tests five question types against multi-session conversational data:
- Single-hop — direct fact lookup from a single conversation turn
- Multi-hop — reasoning across multiple facts or sessions
- Adversarial — questions designed to elicit hallucination
- Temporal — time-dependent facts that change across sessions
- World-knowledge — questions requiring external context
The benchmark ingests synthetic conversation transcripts into the memory system, then poses questions that require accurate recall and reasoning.
Setup
The benchmark tool lives at references/memorybench/.
Isolated Environment
Create a throwaway database so test memories stay separate from production:
# Create isolated environment
mkdir -p /tmp/signet-bench/memory
cp $SIGNET_WORKSPACE/agent.yaml /tmp/signet-bench/
ln -sf $SIGNET_WORKSPACE/.models /tmp/signet-bench/.models
# Start isolated daemon
SIGNET_PATH=/tmp/signet-bench SIGNET_PORT=3851 bun packages/daemon/src/daemon.ts &
# Run benchmark against isolated daemon
cd references/memorybench
SIGNET_BASE_URL=http://localhost:3851 bun run src/index.ts run -l 50
Pipeline settings do not need to be modified for benchmarks. The
Signet provider sends pre-extracted structured data with each memory,
and the daemon’s structured passthrough path sets
extraction_status = "complete" — the pipeline worker skips these
memories automatically even when the pipeline is enabled.
Common Flags
-l <count>— limit question count (e.g.,-l 50for quick runs)-r <run-id>— resume a previous run (continues from last checkpoint)-r <run-id> -f search— rerun search/answer/eval phases with the same ingested data (useful for A/B testing retrieval changes)-p signet -b locomo— provider and benchmark selection
Output
Results go to references/memorybench/data/runs/<run-id>/ with per-question
scores and an aggregate summary.
Results
Latest Results (2026-03-22, run-full-stack-8)
| Metric | Score |
|---|---|
| Accuracy | 87.5% |
| Hit@10 | 100% |
| MRR | 0.615 |
| Precision@10 | 26.3% |
| Recall@10 | 100% |
| NDCG@10 | 0.639 |
By question type:
| Type | Questions | Correct | Accuracy |
|---|---|---|---|
| Multi-hop | 4 | 4 | 100% |
| Temporal | 1 | 1 | 100% |
| Single-hop | 3 | 2 | 66.7% |
Configuration: 8-question LoCoMo sample, gpt-4o extraction, gpt-4o answering, gpt-4o judging. Full retrieval stack: graph traversal + FTS5 + vector search, post-fusion dampening (DP-16), lossless session transcripts, decision auto-protection (DP-18), improved temporal extraction rules.
Progression
| Run | Date | Questions | Accuracy | Hit@10 | MRR | Stack |
|---|---|---|---|---|---|---|
| baseline (local) | 2026-03-20 | 50 | 36% | 76% | 0.494 | traversal + FTS + vector |
| baseline (cloud) | 2026-03-20 | 50 | 34% | 76% | 0.495 | traversal + FTS + vector (cloud embeddings) |
| run-temporal-25 | 2026-03-22 | 25 | 56% | 84% | 0.534 | + temporal extraction rules |
| run-full-stack-8 | 2026-03-22 | 8 | 87.5% | 100% | 0.615 | + DP-16 dampening + lossless transcripts + gpt-4o extraction |
Note: Different question samples across runs. The 87.5% result is on a smaller 8-question sample. Larger-scale validation pending.
Comparison
| System | Benchmark | Metric | Score | Inference Calls/Query |
|---|---|---|---|---|
| Signet | LoCoMo | Accuracy | 87.5% | 0 (retrieval only) |
| Signet | LoCoMo | Hit@10 | 100% | 0 |
| Ori-Mnemos | LoCoMo | Recall | 44.7% | N/A |
| Ori-Mnemos | LoCoMo | MRR | 32.4% | N/A |
| Zikkaron | LoCoMo | Recall@10 | 86.8% | N/A |
| Zikkaron | LoCoMo | MRR | 70.8% | N/A |
| Supermemory ASMR | LongMemEval-s | Accuracy | 97.2% | 15-19 |
| Ori-Mnemos | HotpotQA | Recall@5 | 90% | N/A |
| Zikkaron | LongMemEval | Recall@10 | 96.7% | N/A |
Metrics not directly comparable across systems (different benchmarks, k-values, dataset splits, evaluation methodology). Signet’s 0 inference calls at retrieval time is a useful property of the current stack: the candidate-building path is algorithmic (graph traversal + FTS5 + vector
- dampening), with LLM inference only at extraction time (write path) and answering time (consumer’s responsibility). Long-term, this candidate substrate is meant to support learned context selection rather than stand as the final story by itself.
Failure Analysis (run-full-stack-8)
1 failure out of 8 questions. The failing question had Hit@10=1 (correct memory was retrieved) with MRR=0.33 (ranked 3rd). Root cause: answering LLM error, not retrieval failure.
Key Insights
-
Extraction quality dominates accuracy. Moving from gpt-4o-mini to gpt-4o for extraction was the single largest accuracy improvement. Extraction loss was the #1 failure category (6/11 failures in the 25-question run).
-
Dampening separates signal from noise. Post-fusion dampening (gravity + hub + resolution) addresses score bunching where correct facts were buried under similar-scored noise.
-
Lossless transcripts recover extraction gaps. Storing raw conversation text alongside extracted memories means facts dropped by extraction are still available at recall time.
-
Temporal rules matter. Resolving relative dates (“last week”, “next month”) to absolute dates during extraction eliminates an entire failure category.
-
Zero-inference candidate generation is viable. 100% Hit@10 with purely algorithmic retrieval (no LLM calls at search time) validates the current multi-signal substrate and gives the predictor a bounded, useful pool to learn over.
Complete LoCoMo Leaderboard (March 2026)
No standardized LoCoMo leaderboard exists — each system uses different judge models, question subsets, and evaluation prompts. These numbers are collected from published papers and repos.
| Rank | System | Score | Metric | Open Source | Local? | Source |
|---|---|---|---|---|---|---|
| 1 | Kumiho | 97.5% adv, 0.565 F1 | F1 (official) | SDK open | No (cloud graph) | arXiv:2603.17244 |
| 2 | EverMemOS | 93.05% | Judge (self-reported) | No | No | evermind.ai blog |
| 3 | MemU | 92.09% | Judge | Yes | No | memu.pro/benchmark |
| 4 | MemMachine v0.2 | 91.7% | Judge | No | No | memmachine.ai blog |
| 5 | Hindsight | 89.6% | Judge | Yes (MIT) | No | arXiv:2512.12818 |
| 6 | SLM V3 Mode C | 87.7% | Judge | Yes (MIT) | Partial | arXiv:2603.14588 |
| 7 | Signet (full stack) | 87.5% | Judge (GPT-4o) | Yes (Apache) | Yes | Internal (8-Q sample) |
| 8 | Zep/Graphiti | ~85% | Judge (third-party est) | Partial | No | arXiv:2501.13956 |
| 9 | Letta/MemGPT | ~83% | Judge | Yes (Apache) | No | letta.com blog |
| 10 | Engram | 80% | Judge | Yes | No | arXiv:2511.12960 |
| 11 | SLM V3 Mode A | 74.8% | Judge | Yes (MIT) | Yes | arXiv:2603.14588 |
| 12 | Mem0+Graph | 68.4% | J-score (disputed) | Partial | No | arXiv:2504.19413 |
| 13 | SLM Zero-LLM | 60.4% | Judge | Yes (MIT) | Yes | arXiv:2603.14588 |
| 14 | Mem0 (independent) | ~58% | Judge | Partial | No | Letta blog |
| — | Signet (baseline local) | 36% | Judge (GPT-4o) | Yes (Apache) | Yes | Internal |
| — | Signet (baseline cloud) | 34% | Judge (GPT-4o) | Yes (Apache) | No | Internal |
Key context
-
Signet is the only system doing all of this locally. Every system above 74.8% requires cloud LLMs (GPT-4o, Gemini, etc). Signet runs extraction on local Ollama (qwen3.5:4b), embeddings on local nomic-embed-text, and answer generation via gpt-4o.
-
Embedding quality is not the bottleneck. Cloud embeddings (text-embedding-3-large, 3072d) scored the same as local (nomic-embed-text, 768d). Retrieval Hit@K is 76% for both — the system finds relevant memories most of the time. The gap between retrieval and accuracy points to context building and answer generation as the limiting factors.
-
Everyone measures with different rulers. Different judge models, question subsets, and evaluation prompts make direct comparison unreliable.
-
Sample size caveat. The 87.5% score is from an 8-question sample. Larger-scale validation is still required before hard public claims.
Signet LoCoMo Baseline Results (March 2026)
Baseline Numbers (50-question runs)
| Configuration | Score | n | Multi-hop | Single-hop | Temporal |
|---|---|---|---|---|---|
| Local stack | 36% (18/50) | 50 | 38% (9/24) | 26% (5/19) | 57% (4/7) |
| Cloud stack | 34% (17/50) | 50 | 42% (10/24) | 16% (3/19) | 57% (4/7) |
Note: each run samples a different random 50-question subset from LoCoMo’s 1,986 questions. Scores are not directly comparable between runs — only the pattern matters.
Baseline Retrieval Quality
| Metric | Local | Cloud |
|---|---|---|
| Hit@K | 76.0% | 76.0% |
| MRR | 0.494 | 0.495 |
| NDCG | 0.525 | 0.553 |
| Search latency (median) | 5,267ms | 1,953ms |
Retrieval quality is identical between stacks. Cloud is 2.7x faster for search due to API-based embedding vs local Ollama inference.
Analysis
What’s working: Retrieval finds relevant results 76% of the time. Multi-hop and temporal questions perform reasonably (38-57%). The current substrate is producing useful candidates often enough to support both answering and future learned reranking.
What’s not: Single-hop accuracy (16-26%) is the weakest category. These are direct fact lookups where the answer exists in a single memory — the system should score higher. The gap between Hit@K (76%) and overall accuracy (36%) indicates the answering LLM often fails to extract the correct answer from retrieved context, or the context building step loses information.
Where to invest next:
- Context building — the current approach JSON.stringifies the full results array. A structured summary or targeted extraction could help the answering LLM focus on the relevant facts.
- Answer generation prompting — the answering prompt may need tuning to better handle structured memory content.
- Larger sample sizes — 50 questions is noisy. Running the full 1,986 would give more reliable numbers.
Key Techniques
- Prospective indexing (hints) bridges the semantic gap between how facts are stored and how natural queries phrase them
- Cosine re-scoring fixes random traversal ordering — without it, importance values are uniform and results come back in arbitrary order
- Constructed card score capping prevents entity summary cards from dominating real memories in search results
Model Configurations
We benchmark two configurations to isolate the accuracy cost of running locally vs using cloud models:
Local Stack (default)
| Component | Model | Provider |
|---|---|---|
| Extraction | qwen3.5:4b | Ollama (local) |
| Embeddings | nomic-embed-text | Ollama (local) |
| Answer generation | gpt-4o | OpenAI |
| Judge | gpt-4o | OpenAI |
This is Signet’s default production configuration. In benchmarks, extraction is done by GPT-4o in the memorybench provider (not the daemon pipeline), so the extraction model listed here only applies to production use. Embeddings run on-device. Answer generation and judging use cloud models (judging is benchmark-only).
# agent.yaml for local stack
memory:
pipelineV2:
extraction:
provider: ollama
model: qwen3.5:4b
embedding:
provider: native
model: nomic-embed-text-v1.5
dimensions: 768
Cloud Stack (comparison)
| Component | Model | Provider |
|---|---|---|
| Extraction | gpt-4o | OpenAI |
| Embeddings | text-embedding-3-large | OpenAI |
| Answer generation | gpt-4o | OpenAI |
| Judge | gpt-4o | OpenAI |
This configuration matches what most competing systems use. The delta between local and cloud scores isolates the accuracy cost of running on-device models.
# agent.yaml for cloud stack
memory:
pipelineV2:
extraction:
provider: openai
model: gpt-4o
embedding:
provider: openai
model: text-embedding-3-large
dimensions: 3072
Running Both
Use separate isolated environments on different ports:
# Local stack (port 3851)
mkdir -p /tmp/signet-bench/memory
cp $SIGNET_WORKSPACE/agent.yaml /tmp/signet-bench/ # local models config
SIGNET_PATH=/tmp/signet-bench SIGNET_PORT=3851 bun packages/daemon/src/daemon.ts &
# Cloud stack (port 3852)
mkdir -p /tmp/signet-bench-cloud/memory
cp $SIGNET_WORKSPACE/agent.yaml /tmp/signet-bench-cloud/
# Edit agent.yaml: provider: openai, model: gpt-4o, embedding: text-embedding-3-large
SIGNET_PATH=/tmp/signet-bench-cloud SIGNET_PORT=3852 bun packages/daemon/src/daemon.ts &
# Run both benchmarks against same 50 questions
cd references/memorybench
SIGNET_BASE_URL=http://localhost:3851 bun run src/index.ts run -l 50
SIGNET_BASE_URL=http://localhost:3852 bun run src/index.ts run -l 50
Running Experiments
Experiments E1 through E22 tested different search configurations against the same ingested data. The workflow:
- Run a full benchmark once (ingestion + questions) to get a
<run-id> - Modify the daemon’s search configuration (toggle graph traversal, change re-ranking, adjust extraction prompts, etc.)
- Rerun from the search phase to test the new config:
SIGNET_BASE_URL=http://localhost:3851 \ bun run src/index.ts run -r <run-id> -f search - Compare results across
data/runs/directories
This isolates the variable under test — same memories, same questions, different retrieval strategy.