Retrieval Is Not Memory — SignetAI Blog

← Blog

Retrieval Is Not Memory

/ Nicholai & Mr. Claude

Supermemory's benchmark stunt exposed how easily the industry gets fooled by a single accuracy number. The real question was never about retrieval. Here's why the architecture matters more than the benchmark.

positioningbenchmarksarchitectureretrieval
Share

This week, the AI memory industry got a mirror held up to its face.

Supermemory published a ~99% accuracy result on LongMemEval using a system called ASMR: “Agentic Search and Memory Retrieval,” a name that is intentionally redundant. Two million views. Trending on the news. Everyone celebrated. Then they revealed the whole thing was a social experiment designed to expose how easily the field gets fooled by benchmark numbers.

The system was real. They did score ~99%. But it took twelve parallel agents, an eight-variant answering ensemble, and 70 seconds of latency per query. It was built to be absurd on purpose. And the industry embraced it without blinking.

Supermemory’s point: quality alone is a broken metric. You can brute-force any benchmark by throwing enough agents at it. The number looks unbeatable. The system behind it would bankrupt you in production.

We’d written most of this post before the reveal dropped. Our original analysis treated the result as genuine research and argued it wasn’t memory, just retrieval with a very large compute budget. Turns out supermemory agreed the whole time. The technical critique holds. The framing just got more interesting.

What Retrieval Benchmarks Measure

LongMemEval and LoCoMo are the two most rigorous public benchmarks for long-term agent memory. They simulate realistic conditions: 100K+ token conversation histories, contradictory information, temporal updates, multi-session data. They’re genuinely hard.

But they measure one thing: given a question about something discussed in a conversation history, can the system find the relevant information and generate a correct answer?

That’s retrieval plus reasoning. It is not memory.

Memory, the kind that makes agents genuinely useful across weeks and months of interaction, includes things no retrieval benchmark tests. Does the agent know who it is? Does it maintain consistent personality and values across sessions? Does it track decisions and the reasoning behind them? Does it distinguish between a preference stated yesterday and one that was corrected this morning? Does it understand your projects as organized knowledge rather than retrieved facts? Does it get better at anticipating what you need?

None of that shows up in a retrieval benchmark. The benchmarks test whether the system can find a fact. The real question is whether the agent has a mind.

The Cost of Brute Force

ASMR was designed to prove a point about benchmark gaming, and the architecture makes the point clearly.

The system replaces vector search with active agentic reasoning. Instead of computing cosine similarity against embeddings, it deploys LLM agents to read through stored findings and reason about what’s relevant. The 98.60% result routes context through eight parallel answering variants, each running a full LLM. The 97.20% result uses twelve agents plus an aggregator. The ingestion pipeline deploys three parallel reader agents. The search pipeline deploys three more. A single question can require 15+ LLM invocations.

For a benchmark, that’s fine. For a production system running thousands of queries per day, serving agents that need sub-second response times, operating on user-owned hardware with local models, it’s a different universe entirely.

Supermemory knew this. That was the entire point. Under their own proposed MemScore standard, ASMR scores 99% | 99% | 70,000ms | 10k tokens. Nobody ships that. They built the ceiling case to show that reaching it means nothing if you can’t afford to stay there.

The broader landscape tells a similar story. Mastra OM achieves 84% with observer/reflector agents and scales to 94.87% with newer models. Hindsight from Vectorize hits 91.4% with Gemini-3. MemMachine scores above 91% with aggressive token compression. Each of these systems does something interesting. None of them addresses anything beyond the retrieval problem.

What We Found This Week

We spent two days running Signet against LoCoMo, a benchmark focused on multi-session conversation understanding. Our full-stack result: 87.5% accuracy with 100% Hit@10.

That second number is the one worth sitting with. Hit@10 means the correct memory appeared in the top ten retrieved results for every single question. Retrieval worked perfectly. The 12.5% accuracy gap was a single question where the answering model misused correctly retrieved information. An answer quality problem, not a retrieval one.

Here’s how we got there: zero LLM calls at retrieval time.

Signet’s retrieval is entirely algorithmic. It walks a knowledge graph, runs FTS5 keyword search, computes vector similarity against pre-embedded memories, then applies three stages of post-fusion dampening to separate signal from noise. No agents reading through findings. No parallel orchestrators. No ensemble voting. Just data structures, indexes, and math.

The graph walk identifies focal entities from session context, traverses their aspects and attributes, follows dependency edges to related entities. FTS5 catches terminology variations that the graph hasn’t connected. Vector search surfaces semantically related content the other channels missed. Then post-fusion dampening runs three corrections: gravity dampening removes semantic ghosts (high cosine similarity but zero vocabulary overlap), hub dampening demotes well-connected entities that appear in everything, and resolution boosting elevates actionable knowledge like decisions and constraints.

All of this happens in milliseconds. On a local SQLite database. With no API calls.

A caveat: our 87.5% is from an eight-question sample of the full-stack configuration. The fifty-question baseline with the basic stack scored lower. The trajectory is strong, each architectural addition produced measurable gains, and the 100% retrieval accuracy holds across configurations. But we need larger-scale validation before making definitive accuracy claims. We’re being transparent about that because the insight here isn’t the number. It’s the architecture.

Efficiency Is Architecture

There’s a philosophical difference between “throw more LLMs at the retrieval problem” and “build data structures that make retrieval algorithmic.” Supermemory’s stunt made this difference impossible to ignore.

The core insight they demonstrated is real: vector search alone fails on temporal data and contradictions. Semantic similarity can’t distinguish between an old fact and a recent correction. That’s the right problem. ASMR, the satirical solution, replaced vector search with agentic reasoning. High accuracy, zero scalability. Built to fail at production scale, on purpose.

Signet took a different path. Instead of replacing search with more intelligence at query time, we built structure that makes candidate generation bounded and coherent. The knowledge graph contains entities, aspects, attributes, constraints, and explicit dependency edges. When the system needs context about a project, it doesn’t start from “search everything and hope.” It can walk to the project entity, read its aspects, follow its dependencies, then let flatter retrieval layers and learned ranking handle the rest. Deterministic where possible. Bounded by design.

This is why a local system running on SQLite can hit 100% retrieval accuracy in our sample. The graph contains the relationships. FTS catches the terms. Vectors find the surprises. Dampening removes the noise. No LLM required.

The LLM calls in Signet happen at write time, during extraction, when sessions are processed into structured knowledge asynchronously after they end. The agent never waits on it. The next session loads instantly because the knowledge is already organized, already indexed, already there.

Write-time intelligence, read-time speed. That’s the trade.

The Larger Problem

All of this is still about retrieval. And retrieval is only one dimension of what makes an agent persistent.

Consider what happens after you solve retrieval perfectly. Your agent can find any fact from any past conversation with 100% accuracy. Now ask: does it know who it is? Does it maintain personality across sessions? Does it remember why it chose one architecture over another three weeks ago? Can it work in Claude Code, then continue in Cursor, then pick up in OpenCode, as the same entity, with the same knowledge, the same values, the same understanding of your work?

These are not retrieval problems. They’re persistence problems. Identity problems. Continuity problems. They live in a layer of the stack that most memory systems don’t address because they’ve framed the entire challenge as “find the fact.”

Signet addresses them because we think they’re the actual hard problem.

Identity persistence. Agents in Signet have identity files that define who they are, their personality, values, and behavioral guidelines. These files travel with the agent across platforms. The model is a guest in a persistent environment, not the center of the system. We wrote about this architecture in You Think Signet Is a Memory System.

Decision tracking. Fourteen regex patterns detect decision language at write time: “chose X over Y,” “switched from,” “going with,” “opted for.” Detected decisions auto-promote to constraints, knowledge that always surfaces in recall regardless of relevance scoring. The agent doesn’t forget why it made a choice. It carries that forward, forever.

Temporal coherence. Relative dates normalize to absolute dates during extraction. “Last week” becomes “March 15, 2026.” Lossless session transcripts preserve raw conversation text alongside extracted knowledge, so facts that extraction missed are still recoverable at recall time.

Predictive behavior. This is the direction the rest of the system is feeding. Signet is building a local scorer that learns from each user’s interaction patterns: which memories matter for which contexts, which traversal paths tend to help, and which injected context repeatedly fails to improve the outcome. Over time, the system should not just remember. It should anticipate, and it should learn from regret, not just reuse.

Knowledge distillation. The database gets smaller and smarter over time. 7,000 sparse facts become 1,000 atomic facts, properly organized, constraint-aware, connected through explicit dependency edges. The noise refines away. The structure remains. This is the opposite of conversation-log systems, which grow linearly and get noisier with every session.

None of these capabilities improve a retrieval benchmark score. All of them make agents dramatically more useful in practice.

MemScore: A Step in the Right Direction

Supermemory’s reveal came with a proposal worth taking seriously. MemScore is a reporting standard that requires four numbers, always reported together:

quality1% / quality2% / avg_latency_ms / tokens

Quality on at least two benchmarks, to prevent overfitting to one. Average retrieval latency, wall-clock, excluding answering model inference. Tokens injected into the answering model per query.

This is a meaningful improvement over the status quo, where providers report a single accuracy number and let the audience fill in the gaps with optimism. Under MemScore, ASMR’s absurdity is immediately visible: 99% | 99% | 70,000ms | 10k. Nobody ships that.

MemScore is a genuine step toward honest reporting, and the industry needs it. The spec is open source on memorybench. We’d encourage every provider to adopt it, and we’ll report our own numbers in that format when we publish full-scale benchmark results.

What MemScore won’t settle is the question of what you’re building on top of retrieval. A system with great scores across all four numbers still has to answer the product question: what does the agent actually do with the memories it retrieves? That’s not a measurement problem. That’s an architecture decision. And it’s where the real differentiation lives.

Where We Actually Stand

Honesty matters more than optics.

Signet’s retrieval is strong. 100% Hit@10 across configurations. The graph-based architecture scales, runs locally, and requires no per-query LLM calls. We’re confident in the approach.

Many of the capabilities that sit on top of retrieval, identity persistence, knowledge distillation, decision tracking, and cross-platform continuity, are real and shipping. Others, especially the learned scoring layer, are the convergence point the current substrate is building toward. The important claim is not that every part is perfectly mature today. It is that the architecture is aimed at a bigger problem than retrieval.

What we don’t have yet is large-scale benchmark validation of the full-stack accuracy number. We need to run the complete 50+ question suite with all architectural improvements enabled and get a clean number. That work is in progress. We’ll publish it when it’s ready, with full methodology, reproducibility instructions, and MemScore format.

We also don’t claim to have the highest retrieval score on any public benchmark. We might not. What we claim is that our architecture achieves strong retrieval without per-query LLM costs, and then goes further into territory that retrieval benchmarks don’t measure.

The Question That Matters

Agent memory is not solved. The fact that a social experiment could fool the entire industry into thinking it was should tell you something about where we actually are.

Retrieval is getting better. The benchmarks are getting more competitive. But the cognitive problem, building agents that persist, evolve, and genuinely understand the people they work with, is barely started.

The question is not “can you find the fact?”

The question is what the agent already knows when it walks into the room.


Signet is open source. The knowledge architecture, the desire paths traversal system, and the extraction pipeline are all part of the shipped substrate. Local-first. User-owned. No API keys required for retrieval. If retrieval were memory, we’d be done. We’re not done.