Benchmarking progress log

This is a development progress log, not a publishable benchmark claim. Run artifacts and reports remain ignored under memorybench/data/runs/; this document only records selected local tuning summaries so regressions can be understood later.

2026-04-18: structured evidence recall pass

This is a development progress log, not a publishable benchmark claim. The runs below used small six-question LongMemEval samples while tuning the Signet provider and daemon recall behavior.

The important change was adding structured evidence shaping to daemon recall. Instead of collapsing every signal into one early score, recall now keeps lexical, semantic, prospective hint, and traversal evidence separate before reranking and dampening. Traversal-only candidates are capped below directly anchored evidence, and hint matches can rescue class-to-instance questions such as “music streaming service” matching a memory that says “Spotify.”

Run	Setup	Accuracy	Hit@K	MRR	NDCG	Mean search
`lme-openrouter-six-20260418T194618Z`	OpenRouter ingestion, pre-SEC recall	5/6, 83.3%	100%	0.625	0.734	761 ms
`lme-sec-six-20260419T0339Z`	OpenRouter ingestion, SEC recall, warmed dev workspace	6/6, 100%	100%	0.917	0.930	848 ms
`lme-dev-six-20260419T0818Z`	Mercury-2 ingestion via OpenRouter, structured remember graph, Gemma 4 26B Q5 answer/judge via llama.cpp	6/6, 100%	100%	0.889	0.873	1427 ms
`lme-secpath-six-20260419T090700Z`	Same warmed workspace, structured path evidence added as a SEC recall channel	6/6, 100%	100%	0.833	0.857	1042 ms

The direct comparison is encouraging because the hit rate was already high, but the ranking was weak. SEC improved the order of the evidence, not just whether the answer appeared somewhere in the pile. That matters more than the one extra correct answer: a higher MRR means the right memory is closer to the top, which reduces how much context the answer model has to search through.

The lme-dev-six-20260419T0818Z run is the first small pass after separating graph reads from background extraction graph writes and routing benchmark graph structure through structured remember. Its graph hygiene report was clean: 0 suspicious entities, 0 duplicate canonical groups, and 0 active attributes missing group_key, claim_key, or source memory. It did surface 8 safe known-entity mention candidates, which is normal repair/normalization work, not background graph authorship.

The structured graph run preserved 100% accuracy and 100% Hit@K, and it massively improved ranking over the earlier currentness-only run (MRR 0.889 vs 0.458, NDCG 0.873 vs 0.482). It was slightly behind the SEC-only six-question run on aggregate ranking (MRR 0.889 vs 0.917, NDCG 0.873 vs 0.930) because the single-session-preference question ranked the right evidence third (MRR 0.33) even though the answer was judged correct. That was the retrieval wrinkle targeted by the follow-up patch below.

The follow-up patch added structured path evidence as a SEC recall channel. Recall now scores candidate memories against active structured graph rows using aspect, group key, claim key, attribute kind, and attribute content, while still requiring concrete query overlap before preference/advice boosts apply. This fixed the first regression target: in lme-secpath-six-20260419T090700Z, the single-session-preference question ranked the virtual coffee-break memory first instead of third (MRR 1.00, NDCG 1.00).

The aggregate MRR moved from 0.889 to 0.833 in that rerun because the local Gemma relevance pass marked the first relevant hit later for the multi-session and single-session-user questions, even though the benchmark answer/judge score remained 6/6. The retrieval wrinkle we targeted is fixed, but the next larger run should watch whether this is judge noise or a real ranking tradeoff outside the preference/advice path. The important durable TODO is that every recall surface should eventually use the same structured-path SEC evidence, not just the MemoryBench-facing daemon recall path.

The latency tradeoff is improving but still visible. Mean search increased from 761 ms in the pre-SEC run to 848 ms with SEC, then 1427 ms in the first structured graph run. After the structured-path patch, the same warmed workspace searched in 1042 ms mean with SIGNET_BENCH_EMBEDDING_PROVIDER=ollama set explicitly, so the path is moving the right direction without dropping accuracy.

The same tuning pass also raised ingest concurrency for local development. With OpenRouter Mercury extraction, question-level ingest concurrency 3 plus per-question session concurrency 6 ingested 283 LongMemEval sessions across six questions in about 7 minutes wall time. The extraction/remember side averaged about 65 seconds per question. The remaining long pole was indexing wait on a couple of questions, not extraction.

The first local-only 12-question loop used Gemma 4 E4B through vLLM for structured ingestion and Gemma 4 26B Q5 through llama.cpp for answering and judging. It ingested 575 LongMemEval sessions in about 31 minutes with question-level ingest concurrency 3 and session concurrency 8. One session hit the local vLLM 8192-token context limit by a single token, so future local runs should either pre-truncate extraction input or raise the local context budget before treating larger samples as production numbers.

That first 12-question run, lme-local-vllm12-20260419T091740Z, scored 11/12 with MRR 0.639, NDCG 0.679, Hit@K 83.3%, F1 0.418, mean search 969 ms, and 2328 average answer-context tokens. The miss was not an ingestion failure: the turbinado-sugar memory existed, but the cookie-advice query did not rank it high enough for Gemma to use it correctly.

The follow-up local run added three narrow harness/recall corrections rather than a memory-system rewrite. First, daemon recall now applies a small mechanical keyword expansion for explicit baking/recipe queries, e.g. cookies can bridge to sugar, flavor, texture, ingredients, recipes, and desserts. Second, gravity dampening now applies to SEC/structured sources too, so structurally shaped results without surface support do not get a free pass. Third, the Signet MemoryBench answer prompt now tells the answering model to use remembered preferences to give concrete personalized advice, not merely repeat the known preference back to the user.

With those changes, lme-local-vllm12-answerprompt-20260419T102313Z scored 12/12 with MRR 0.792, NDCG 0.816, Hit@K 91.7%, F1 0.479, mean search 1026 ms, and 2246 average answer-context tokens. The cookie-advice target moved from rank 6 to rank 1, but the answer/judge path was still too fragile: one judge pass accepted an answer that mostly repeated the turbinado-sugar preference instead of building on it.

The next patch fixed two harness issues exposed by that run. First, LongMemEval already marks the answer-bearing session with answer_session_ids/has_answer, so MemoryBench now uses those labels for retrieval metrics when they are available instead of asking the local judge to guess relevance from retrieved text. This is not extra context for answering; it only makes retrieval scoring less noisy. That corrected the colleague / virtual-coffee case: the relevant memory was already at rank 2, but the local relevance judge had marked it as irrelevant. Second, the Signet answer prompt now says that advice questions should treat remembered preferences as the starting point and recommend a next step, pairing, variation, or technique that builds on them.

The resulting run, lme-local-vllm12-answer-refine-20260419T104059Z, scored 12/12 with Hit@K 100%, MRR 0.833, NDCG 0.889, F1 0.548, mean search 1026 ms, and 2246 average answer-context tokens on the same warmed local workspace. The local-only comparison now looks like this:

Run	Setup	Accuracy	Hit@K	F1	MRR	NDCG	Mean search	Avg context
`lme-local-vllm12-20260419T091740Z`	Local vLLM E4B ingest, 26B Q5 answer/judge, pre-baking/advice fix	11/12, 91.7%	83.3%	.418	.639	.679	969 ms	2328 tok
`lme-local-vllm12-answerprompt-20260419T102313Z`	Baking/advice recall shaping plus initial advice prompt	12/12, 100%	91.7%	.479	.792	.816	1026 ms	2246 tok
`lme-local-vllm12-answer-refine-20260419T104059Z`	Exact LongMemEval retrieval labels plus refined advice prompt	12/12, 100%	100.0%	.548	.833	.889	1026 ms	2246 tok

There is still a useful caveat here: this is a warmed local development workspace, not a publishable score. It is exactly the right fixture for tuning recall and answer behavior without re-ingesting 575 sessions every time, but a public number should be regenerated from a fresh isolated workspace with the final production model choices.

The fixed 12-question canary created for autoresearch is a separate fixture, so do not compare it directly against the random/warmed runs above. Its purpose is to keep the same known failure cases in front of us while changing one piece at a time. On that fixed set, the first local-only run exposed four concrete failures: compressed recommendation details, missing count-update arithmetic, weak temporal/currentness use, and low-confidence transcript fallbacks outranking stronger structured evidence. The follow-up kept transcript fallback as a bounded recall feature but capped transcript-only hits below real memory evidence and prepended short transcript excerpts to retrieved memory rows when expand: true is requested.

That pass also fixed a harness bug: LongMemEval question_date was loaded from the dataset but not written into the run checkpoint, so temporal answer prompts were saying Question Date: Not specified on resumed runs. That was the real reason the Ibotta question could retrieve the right 16 April 2023 memory and still fail to answer “3 weeks ago.” The checkpoint now stores question dates on new runs and backfills them when resuming older checkpoints.

Fixed canary run	Setup	Accuracy	Hit@K	F1	MRR	NDCG	Mean search	Avg context
`lme-canary12-vibes-20260419T163939Z`	Fixed 12Q local ingest, pre-transcript/date fixes	8/12, 66.7%	91.7%	.607	.819	.842	1790 ms	1841 tok
`lme-canary12-vibes-20260419T163939Z-transcript-lite`	Bounded transcript fallback, before score cap	10/12, 83.3%	100.0%	.434	.579	.670	1894 ms	2450 tok
`lme-canary12-vibes-20260419T163939Z-transcript-capped`	Transcript fallback capped below real memory evidence	11/12, 91.7%	100.0%	.434	.903	.947	1924 ms	2792 tok
`lme-canary12-vibes-20260419T163939Z-transcript-datefix`	Transcript cap plus checkpoint `question_date` preservation	12/12, 100%	100.0%	.434	.903	.947	1866 ms	2792 tok
`lme-canary12-fresh-9503efd5-20260419T175605Z`	Fresh local E4B ingest, 26B Q5 answer/judge, SEC path rank + metric fixes	12/12, 100%	100.0%	.420	1.000	.982	1329 ms	2815 tok
`lme-canary12-summary-hydration-20260420T043330Z`	Same warmed canary after transcript fallback hydrates same-session summary	12/12, 100%	100.0%	.371	.847	.866	1162 ms	4298 tok

The F1 drop in the fixed canary is expected from adding supplemental recall evidence. It means more non-answer evidence is visible, not that the answer path got worse. The important ranking metrics recovered after transcript hits were capped: MRR returned from 0.579 to 0.903, and NDCG from 0.670 to 0.947, while accuracy reached 12/12 once the missing question-date metadata was repaired.

The fresh local canary was regenerated from an isolated workspace using Gemma 4 E4B via vLLM for ingestion and Gemma 4 26B Q5 via llama.cpp for answering and judging. It kept 12/12 accuracy, moved every answer-bearing session to rank 1, and lowered mean search latency by about half a second against the date-fix canary. The last ranking wrinkle was the colleague / virtual coffee-break question: moderate structured path evidence now has enough weight to beat generic “thinking/suggestions” lexical noise from unrelated recommendation sessions.

That run also exposed and fixed a retrieval-metric bug. When LongMemEval provides answer_session_ids, duplicate chunks from the same answer-bearing session must not count as multiple ideal relevant documents. Without that cap, NDCG could exceed 1.0, which is mathematically invalid. Retrieval scoring now counts each labeled relevant session once, so duplicate evidence is useful for answering but does not inflate NDCG.

The summary-hydration run reused the warmed canary workspace after a post-rebase safety run regressed to 11/12. The miss was not a ranking miss: both relevant sessions were retrieved, but the transcript fallback excerpt for the Wednesday yoga session started too late and omitted the exact schedule fact. The daemon now hydrates transcript-only fallback hits with the same-session structured memory summary when one exists. That let the answerer combine Tuesday/Thursday Zumba, Wednesday yoga, and Saturday weightlifting into the correct four-day answer without changing the underlying ingested data.

Merge story and surface parity plan

The merge story for this PR is intentionally narrow: land the MemoryBench integration, isolated benchmark daemon workflow, structured remember path, currentness/supersession fixes, SEC recall shaping, transcript fallback, and benchmark documentation as one coherent benchmark foundation. Do not turn this branch into the PR where every harness learns every new memory behavior. The daemon is the source of truth for recall semantics; harnesses should be plumbing and presentation, not separate recall engines.

The follow-up should be a smaller recall-surface parity PR. Its contract should be simple: /api/memory/recall is the canonical recall engine, /api/hooks/user-prompt-submit uses the same engine with a tighter auto-inject budget, /api/memory/remember is the canonical structured remember entrypoint, and knowledge graph navigation is exposed through explicit daemon/CLI/MCP tools instead of hidden benchmark-only behavior. The parity PR should include a small contract test proving CLI, MCP, SDK, hook recall, and harness recall all consume the daemon recall path rather than reimplementing structured evidence, SEC ranking, currentness annotations, or transcript fallback locally.

For the CLI and MCP surfaces, the graph should be navigable in the same mental model agents and humans use when browsing a filesystem or rooms in a house: entity, aspect, group, claim, attribute. knowledge_expand can stay as the “whole card” view, but agents also need narrow list/get tools so they can scan large graphs without requesting a giant blob. The target shape is:

entity.list()
entity.get("Nicholai")
entity.aspects("Nicholai")
entity.groups("Nicholai", "food")
entity.claims("Nicholai", "food", "restaurants")
entity.attributes("Nicholai", "food", "restaurants", "favorite_restaurant")

For harnesses, the audit is mostly about preserving daemon output faithfully. OpenClaw, OpenCode, Pi/Oh-My-Pi, Hermes, the browser extension, and SDK clients should preserve enough metadata to debug recall: result source, source session, supplementary status, structured/SEC/transcript origin, currentness annotations, and expanded sources when requested. They should not decide their own ranking or graph traversal rules. Explicit recall and benchmarks can opt into expand: true; automatic prompt injection should stay tighter and use transcript evidence as bounded rescue when structured recall is empty or anchor terms are missing. That keeps prompts useful without turning every turn into transcript soup.

After this PR merges, run one fresh clean benchmark from an isolated workspace with the final production model choices before publishing any score. Warmed workspace runs are for development ratcheting; fresh isolated runs are the only credible public numbers.

The first unattended autoresearch loop after that canary reused a fresh local 12-question workspace, then copied the checkpoint from search forward to test recall-side changes without re-ingesting. The first report looked better than it really was: two temporal questions produced empty local-model answers and were omitted from the denominator, leaving a misleading 9/10. The harness now records an explicit I don't know. when the answer model returns empty output, so every selected question remains in the score denominator.

The same run exposed two recall-side issues. First, transcript excerpts were choosing the first weak lexical hit, such as an assistant later saying “ask Mark and Sarah”, instead of the densest window containing the user’s actual temporal fact, “I met Mark and Sarah on a beach trip about a month ago.” Second, relative temporal search questions such as “four weeks ago” were sent to recall without the provided LongMemEval question date, so recall could not mechanically search for the resolved date. The fix was intentionally small: transcript fallback now scores candidate windows by query-term density, quantity/temporal hints, and simple verb variants like meet/met; the Signet provider now appends absolute date hints for relative search phrases using the benchmark-provided question date.

Random local 12Q run	Setup	Accuracy	Hit@K	F1	MRR	NDCG	Mean search	Avg context
`lme-local-explore12-20260419T190341Z`	Fresh local E4B ingest, before denominator/transcript/date-hint fixes	9/10, 90.0%	100.0%	.438	.900	.911	1236 ms	2751 tok
`lme-local-explore12-20260419T190341Z-dense-transcript-20260419...`	Same data, dense transcript selection plus explicit empty-answer abstention	10/12, 83.3%	100.0%	.416	.854	.864	504 ms	2780 tok
`lme-local-explore12-20260419T190341Z-temporal-search-20260419...`	Same data, dense transcript selection plus relative-date search hints	12/12, 100%	100.0%	.423	.875	.890	579 ms	3046 tok

The later lme-local-explore12-20260419T202440Z workspace is another random 12-question fixture, not a direct continuation of the 190341Z fixture. It looked like a regression because the previous random slice had recovered to 12/12, but the right comparison is within the same ingested data source. That fixture exposed a different skeleton: the structured graph contained the answer facts, but daemon recall was not surfacing those structured rows as first-class candidates.

In the baseline run, both single-session-user questions missed retrieval entirely. Direct database inspection showed the structured attributes were there, for example a music_preferences / listening_habits / recent_platform claim saying the user had been listening to songs on Spotify lately, and a personal_preferences / shampoo_preferences / preferred_shampoo_scent_and_source claim saying the user liked lavender shampoo from Trader Joe’s. The failure was not extraction. It was candidate shaping: traversal-primary recall could spend the flat candidate budget before structured candidates ever reached SEC ranking. That meant the thing we saved into the graph was visible to the database, but not reliably visible to recall.

The first structured-candidate patch proved the diagnosis but did not fix the path. It searched active structured attributes generically, but those candidates were still reduced to a small additive boost behind unrelated vector neighbors. It also briefly included a too-specific spotify query expansion while debugging. That expansion did not improve the run, and it has been removed because provider-side bridges may connect generic vocabulary, e.g. music to song or playlist, but must not include answer-specific product names. The fairness rule is simple: if the LongMemEval data were swapped out, the same generic structure and query bridges should still make sense.

The follow-up patch makes structured rows a real recall surface. Daemon recall now searches active entity_attributes through entity, aspect, group key, claim key, attribute kind, and content; merges those memory ids into the SEC candidate pool before flat candidate trimming; and lets strong structured path evidence stand on its own instead of being only a tiny bonus on top of embedding score. That recovered both single-session-user questions on the same warmed fixture without hardcoded answer terms. The remaining miss had the answer-bearing session in the retrieved set, so the next issue is answer/context handling, not missing graph recall.

Random local 12Q run, second fixture	Setup	Accuracy	Hit@K	F1	MRR	NDCG	Mean search	Avg context
`lme-local-explore12-20260419T202440Z`	Fresh local E4B ingest, before structured candidates were surfaced	9/12, 75.0%	83.3%	.330	.708	.727	451 ms	3219 tok
`lme-local-explore12-20260419T202440Z-structured-candidates-...`	Same data, initial attribute search, candidates still lost before SEC	10/12, 83.3%	83.3%	.330	.708	.727	451 ms	3182 tok
`lme-local-explore12-20260419T202440Z-structured-fused-...`	Same data, structured rows merged into SEC pool, 16k local answer ctx	11/12, 91.7%	100.0%	.329	.833	.843	449 ms	4449 tok

This is the regression story in plain terms: the graph work was helping only when the right memory also survived ordinary candidate selection. Once a random slice asked for facts whose strongest evidence lived in aspect/group/claim shape rather than obvious lexical text, recall dropped them. The fix is not to make extraction more clever or add benchmark-specific names. The fix is to make recall use the structure it was already being given. That raised Hit@K from 83.3% to 100% on the same fixture while keeping mean search basically flat. The cost is context size: average answer context rose to 4449 tokens, so the next tuning pass should reduce duplicate/noisy context now that the right structured evidence is actually entering the pool. The remaining failed question was an advice-style recommendation prompt where the relevant healthcare-AI publication preference was retrieved at rank 2 (MRR 0.50 for that question), but the answer model also used unrelated retrieved context about nanotechnology and robotics. That is now an answer/context shaping problem, not a missing structured recall problem.

The 9/10 row should not be read as a better score than the 10/12 row. It is the opposite: the denominator bug hid two temporal failures. Once every question was counted, the remaining misses were exactly the kind of temporal recall failures we wanted the loop to surface. The final row fixed both temporal questions without changing ingestion or adding hidden answer context. The retrieval metrics remain lower than the fixed canary because this random set has harder temporal/noisy-neighbor cases, but answer accuracy recovered to 12/12 and the run now reports all selected questions cleanly.