Benchmarking

Signet memory benchmarks run through memorybench/. The benchmark harness owns the datasets, checkpointing, answer generation, judging, retrieval metrics, and reports. Signet is only a MemoryBench provider.

memorybench/ is a root workspace because it is a development harness, not a Signet runtime package. Runtime code still belongs under platform/, human surfaces under surfaces/, integrations under integrations/, and reusable libraries under libs/.

Current LongMemEval score

Signet’s latest tracked MemoryBench LongMemEval runs average 97.6% answer accuracy under the rules profile. This is an average across the current tracked local and canary score set, not a claim that every individual run lands at exactly that value.

For the underlying run ledger and per-run accuracy, Hit@K, F1, MRR, NDCG, latency, and context-size notes, see docs/BENCHMARKING-PROGRESS.md.

Default developer run

bun run bench

This command:

Builds the workspace with bun run build.
Creates a temporary isolated Signet workspace under /tmp.
Starts a Signet daemon bound to 127.0.0.1 on a free port.
Runs MemoryBench against LongMemEval using the signet provider.
Shuts the daemon down and removes the temporary workspace.

The default run uses a small LongMemEval sample, one question per question type, so developers can run it while iterating. The command prints the exact MemoryBench command and run id before executing.

MemoryBench reports scores by question type in report.json, so the default run gives a clean per-type breakdown without extra commands. Benchmark reports and run artifacts stay under ignored paths and should not be committed until the team explicitly decides to publish a score.

Larger runs

Run the full LongMemEval benchmark:

bun run bench -- --full

Run a fixed-size sample:

bun run bench -- --limit 20
bun run bench -- --sample 3

--limit and --sample select questions, not individual sessions. A small LongMemEval question set can still ingest many sessions because each question has its own haystack.

Run one LongMemEval question type:

bun run bench -- --type temporal-reasoning --limit 20
bun run bench -- --type knowledge-update --sample 5

Valid LongMemEval types include:

single-session-user
single-session-assistant
single-session-preference
multi-session
temporal-reasoning
knowledge-update

Run an explicit question set:

bun run bench -- --question-id 32260d93 --question-id 54026fce
bun run bench:ingest -- --question-ids-file memorybench/config/autoresearch/longmemeval-canary-12.txt

--question-id is repeatable and also accepts comma-separated ids. --question-ids-file reads one id per line and ignores blank lines and # comments. Use this for fixed canaries so reruns test the same questions instead of whatever a fresh random sample happens to draw.

Skip the workspace build when you already built locally:

bun run bench -- --no-build --limit 10

Keep the isolated workspace for inspection:

bun run bench -- --keep-workspace --limit 5

Preview the command without building, starting the daemon, or running the benchmark:

bun run bench -- --dry-run

Two-stage local model workflow

For local tuning, keep extraction cheap and reserve the stronger model for the parts that affect reasoning quality:

Run ingest/indexing with a small fast model.
Continue the same checkpoint from search with the larger answer/judge model.

This avoids spending hours re-ingesting LongMemEval sessions through a 26B model while still testing answer quality with the model we care about.

Example split run:

export RUN_ID="lme-rules-split-$(date -u +%Y%m%dT%H%M%SZ)"
export WORKSPACE=".bench/workspaces/longmemeval-structured"

Start a fast OpenAI-compatible extraction server, for example Gemma E4B via vLLM, then ingest. On a single-GPU workstation this can use the same port as the later answer server because the phases run sequentially:

OPENAI_API_KEY=dummy \
OPENAI_BASE_URL=http://127.0.0.1:8000/v1 \
MEMORYBENCH_EXTRACTION_MODEL=google/gemma-4-E4B-it \
MEMORYBENCH_EXTRACTION_MAX_TOKENS=1200 \
MEMORYBENCH_STRUCTURED_EXTRACTION_MAX_TOKENS=1800 \
SIGNET_BENCH_EMBEDDING_PROVIDER=ollama \
SIGNET_BENCH_EMBEDDING_MODEL=nomic-embed-text \
SIGNET_BENCH_RUN_ID="$RUN_ID" \
bun run bench:ingest -- --no-build --workspace "$WORKSPACE" --limit 6 --concurrency-ingest 1

For impatient local iteration, ingestion can use OpenRouter while answer/judge still use a local model later. Store the key in Signet as OPENROUTER_API_KEY, inject it into the command environment as OPENROUTER_API_KEY, and pass --ingest-openrouter:

signet secret put OPENROUTER_API_KEY

SIGNET_BENCH_RUN_ID="$RUN_ID" \
MEMORYBENCH_SESSION_CONCURRENCY=4 \
bun run bench:ingest -- --ingest-openrouter --no-build --workspace "$WORKSPACE" --limit 6 --concurrency-ingest 2

--ingest-openrouter only affects bench:ingest. It maps the injected OPENROUTER_API_KEY to OPENAI_API_KEY for the MemoryBench extraction client, sets OPENAI_BASE_URL=https://openrouter.ai/api/v1, and defaults MEMORYBENCH_EXTRACTION_MODEL=inception/mercury-2. It does not change local answering or judging in a later bench:evaluate step. For local dev speed, use modest question concurrency plus MEMORYBENCH_SESSION_CONCURRENCY; this only changes how many sessions are ingested at once, not what data is written or how answers are scored.

Then stop the extraction server, start the larger OpenAI-compatible answer/judge server, for example Gemma 26B Q5 via llama.cpp, and continue the same run:

OPENAI_API_KEY=dummy \
OPENAI_BASE_URL=http://127.0.0.1:8000/v1 \
SIGNET_BENCH_ANSWERING_MODEL=google_gemma-4-26B-A4B-it-Q5_K_M.gguf \
SIGNET_BENCH_JUDGE=google_gemma-4-26B-A4B-it-Q5_K_M.gguf \
SIGNET_BENCH_EMBEDDING_PROVIDER=ollama \
SIGNET_BENCH_EMBEDDING_MODEL=nomic-embed-text \
SIGNET_BENCH_RUN_ID="$RUN_ID" \
bun run bench:evaluate -- --no-build --workspace "$WORKSPACE"

The --resume flag is a wrapper guardrail. It prevents bun run bench from adding --force when continuing a checkpoint. Use it whenever the run already has ingested data that should be preserved.

Benchmark profiles

The wrapper supports two explicit Signet profiles:

bun run bench -- --profile rules
bun run bench -- --profile supermemory-parity

rules is the default. It uses the signet provider and follows the common MemoryBench phase contract:

ingest extracted structured memories through /api/memory/remember
search with the orchestrator’s requested limit, currently 10
answer from bounded recall results only
use /api/memory/recall with expand: true, so any lossless source snippets come from the recall API surface itself, not a benchmark-side hidden context channel
pass LongMemEval question_date into the Signet provider so relative temporal search phrases such as “four weeks ago” can be resolved into mechanical absolute-date search hints before recall
do not dump full raw transcripts into the answer prompt

supermemory-parity uses the signet-supermemory-parity provider. It is not a publishable fair-score profile. It intentionally mirrors the upstream Supermemory adapter shape, which does not match the common provider shape required for fair testing. Use it only to diagnose whether a low Signet score is caused by Signet itself or by comparing against Supermemory’s non-conforming, more permissive adapter:

ingest each session as the same date header plus stringified raw JSON conversation that the Supermemory adapter stores
search with a limit of 30, matching the Supermemory adapter’s current hardcoded limit
answer from raw session-shaped memory content instead of extracted memory summaries

Keep results from these profiles separate. A supermemory-parity result answers “how does Signet perform when given Supermemory’s adapter advantage?” A rules result answers “how does Signet perform under the harness contract we intend to publish?”

Supermemory adapter contract violation

For a fair MemoryBench run, a provider should treat the orchestrator’s SearchOptions as the required test contract. The common search phase calls every provider with:

limit: 10
threshold: 0.3

The imported Supermemory provider currently violates that provider contract. It does not honor the required limit: 10 shape passed by the harness. Instead, it hardcodes limit: 30, asks Supermemory to return both summaries and chunks, and uses a provider-specific answer prompt that tells the model to prioritize those chunks as raw source material.

This is not just a harmless implementation detail. It means the upstream Supermemory provider is not being tested under the same required adapter shape as strict providers. It can provide more retrieved items and richer raw session context to the answer model than a conforming provider receives from the common harness contract. This can especially affect incidental-fact questions, where a raw chunk may still contain a fact that an extraction-based provider dropped.

For fair reporting, use the rules profile and document the exact provider contract used. Treat supermemory-parity as a diagnostic profile only. It exists to measure the advantage created by Supermemory’s current non-conforming adapter shape, not to produce a publishable score.

Isolation rules

Benchmarks must never read from or write to ~/.agents/memory/memories.db. The wrapper sets SIGNET_PATH and HOME to temporary benchmark directories before starting the daemon. This prevents production memory, Claude project memory, and user identity files from being mounted into benchmark runs.

The MemoryBench Signet provider scopes every write and search with:

agentId: memorybench
project: memorybench
scope: <question-id>-<data-source-run-id>
sourceType: memorybench-session

That scope is per question, matching MemoryBench’s provider isolation model.

Persistent tuning workspaces

Clean benchmark runs use a fresh temporary Signet database. For development tuning, you can preserve and reuse a benchmark workspace explicitly:

SIGNET_BENCH_EMBEDDING_PROVIDER=ollama bun run bench -- --workspace .bench/workspaces/longmemeval-structured --sample 1

A persistent workspace keeps the Signet database under .bench/, which is ignored by Git. Use this only for tuning and clearly label any results as warmed or reused. Public/comparable benchmark numbers should still use a fresh isolated workspace.

Cached ingestion for local iteration

It is acceptable to do one expensive bulk ingestion into a temporary benchmark workspace, then reuse that workspace while tuning retrieval, ranking, context packing, answering, or judging. Treat this as a development fixture, not a publishable benchmark score.

The safe pattern is:

export RUN_ID="lme-dev-cache-$(date -u +%Y%m%dT%H%M%SZ)"
export WORKSPACE=".bench/workspaces/lme-dev-cache"

# One-time extraction + remember + indexing pass.
SIGNET_BENCH_RUN_ID="$RUN_ID" \
MEMORYBENCH_SESSION_CONCURRENCY=4 \
bun run bench:ingest -- --no-build --workspace "$WORKSPACE" --limit 60 --concurrency-ingest 2

# Repeatable search/answer/judge passes against the same isolated DB.
SIGNET_BENCH_RUN_ID="$RUN_ID" \
bun run bench:evaluate -- --no-build --workspace "$WORKSPACE"

Invalidate the cached workspace and ingest again whenever the thing being tuned changes what gets written to Signet:

extraction prompt or extraction model
structured entity/aspect/hint schema
/api/memory/remember request shape
daemon graph persistence
embedding provider, model, dimensions, or indexing behavior
dataset selection or question sampling

Do not invalidate it for changes that only affect recall behavior after ingest, such as search thresholds, traversal, reranking, context packing, answer model, or judge model. That is the whole point of the cache: isolate ingest cost from the parts we are actively tuning.

When reporting results from a cached workspace, say so plainly. The phrase we use internally is “warmed dev workspace.” A production/comparable score must use a fresh isolated workspace and the intended production model configuration.

Autoresearch ratchet workflow

The overnight random loop was useful for collecting failures, but it was not a self-improving research loop. It repeatedly sampled new twelve-question runs against the same code, so a good row could just mean the sample was easier. That kind of loop is exploration only. It can feed a failure queue, but it is not a scoreboard.

The local autoresearch workflow uses a fixed LongMemEval canary instead:

memorybench/config/autoresearch/longmemeval-canary-12.txt

The canary mixes known failure targets with stable controls across single-session preference, temporal reasoning, knowledge update, multi-session, single-session assistant, and single-session user questions. Keep that file stable unless we intentionally reset the scoreboard.

The ratchet rule is boring on purpose:

Pick one hypothesis.
Make the smallest code or prompt change that tests it.
Run the fixed canary against the same model split.
Keep the patch only if the canary improves without known regressions.
If a random exploration run finds a new skeleton, add it to the failure queue or propose a canary update separately.

Use the helper script for the mechanics:

bun scripts/autoresearch-memorybench.ts status
bun scripts/autoresearch-memorybench.ts ids
bun scripts/autoresearch-memorybench.ts triage --run-id <run-id> --write-queue
bun scripts/autoresearch-memorybench.ts compare --base <old-run-id> --candidate <new-run-id>
bun scripts/autoresearch-memorybench.ts run-canary

run-canary prints the exact two-stage local commands by default. Add --execute only when the local OpenAI-compatible model server is already running. The default local split is Gemma 4 E4B for ingestion through vLLM, Gemma 4 26B Q5 for answer/judge through llama.cpp, and Ollama nomic-embed-text for embeddings. Answer and judge phases run with concurrency 1 by default because the local llama.cpp 26B Q5 server is usually started with one slot. It does not use OpenRouter.

bun scripts/autoresearch-memorybench.ts run-canary --execute

If vLLM and llama.cpp cannot be resident at the same time, run the phases separately:

bun scripts/autoresearch-memorybench.ts run-canary --ingest-only --execute
# swap vLLM for llama.cpp
bun scripts/autoresearch-memorybench.ts run-canary --skip-ingest --execute --run-id <same-run-id>

If they are on separate ports, pass --ingest-base-url and --answer-base-url. If the answer server has more than one safe slot, pass --answer-concurrency and --evaluate-concurrency; otherwise leave them at the default.

If a change only affects recall, ranking, context packing, answer prompting, or judging, reuse the warmed canary workspace and skip ingestion:

bun scripts/autoresearch-memorybench.ts run-canary --skip-ingest --execute

If a change affects extraction, the structured remember request shape, graph storage, embeddings, or indexing, invalidate the workspace and ingest again. That is the line between honest fast iteration and fooling ourselves. No need to laminate the pancake.

Progress log

Development run history and score comparisons live in docs/BENCHMARKING-PROGRESS.md. Keep this file focused on how to run benchmarks and what the harness measures.

What is being measured

The default signet provider uses the public Signet daemon HTTP API:

ingest: POST /api/memory/remember
search: POST /api/memory/recall with expand: true
health: GET /health

During ingest, the provider performs MemoryBench-side structured extraction from each session, then calls the full remember endpoint with:

extracted memory content
structured entities
structured aspects and attributes
hint questions
source metadata and per-question scope
the lossless source transcript

The isolated daemon does not run background extraction or synthesis workers for benchmark ingestion. Those stages stay disabled so the benchmark is not racing async background work or depending on local daemon timing. Graph and traversal are enabled only so recall can use the structured data that was explicitly sent to /api/memory/remember; graph.extractionWritesEnabled stays false so the async extractor cannot create benchmark graph structure. Recall treats active structured rows as a first-class candidate source by searching entity names, aspects, group keys, claim keys, attribute kinds, and attribute content before SEC reranking. This is deliberately generic: bridges may connect broad query classes like music, service, brand, or currentness to nearby vocabulary, but the daemon must not contain LongMemEval answer names or dataset-specific product terms. Prospective hint recall stays enabled because those hints are part of the structured remember payload, not a background extraction shortcut. Recall may attach a bounded transcript excerpt to a retrieved memory, or add a low-confidence transcript-only supplemental hit, when expand: true is requested. Those snippets are returned by the recall API and capped so they cannot outrank real memory evidence; the benchmark harness does not append raw transcripts on its own.

The isolated daemon also disables structural backfill/classification. Benchmark graph state must come from the explicit structured remember payload, not from daemon repair jobs that infer entities after the fact.

Structured remembering contract

Structured remembering is not “proper noun extraction.” Proper nouns are an easy case, but the graph contract is broader and simpler:

entity: a durable referent that can be expanded
aspect: a stable facet of that entity
group key: a navigable subgroup inside an aspect
claim key: the specific updateable claim slot within a group
attribute: a sourced claim value attached to that claim slot

For benchmark memories, generic personal facts should not be dropped just because they are not named products or places. They should be attached to a scoped subject entity. For example:

Entity: MemoryBench User <question-scope>
Aspect: music preferences
Attribute: has been listening to Arctic Monkeys and The Neighbourhood on Spotify lately

Aspect: commute routine
Attribute: commute to work takes about 30 minutes

Aspect: morning routine
Attribute: getting ready takes about one hour

Aspect: dining history
Group key: restaurants
Claim key: korean_restaurants_tried_count
Attribute: tried three Korean restaurants recently on 2023-08-11
Attribute: tried four Korean restaurants so far on 2023-09-30

The remember endpoint accepts structured data directly. If an aspect references an entity that was not present in the relation list, the daemon creates that entity and attaches the aspect/attributes in the same transaction. This lets the client be explicit about what is being saved and why, without relying on the background pipeline to invent graph structure.

Temporal and update-sensitive attributes should preserve words like currently, recently, previously, counts, dates, and before/after relationships. The storage schema has active/superseded status fields, so structured benchmark ingestion sends each session source timestamp into the remember endpoint. Supersession requires the same scoped entity, the same aspect, the same group key, and the same claim key. Attributes without a claim key are saved as ordinary evidence and do not automatically supersede sibling attributes. When a newer structured attribute conflicts with an older attribute on the same grouped claim key, the daemon marks the older attribute as superseded and records the replacement attribute id. Recall then dampens memories whose structured facts are only stale and annotates returned context with a [Signet currentness] note so the answer model can prefer the current replacement instead of guessing from two conflicting memories.

MemoryBench still performs the answer and judge phases itself. This keeps the benchmark comparable with the other providers and avoids benchmark-specific changes to MemoryBench scoring logic.

Environment knobs

SIGNET_BENCH_FULL=1                 Run the full benchmark by default.
SIGNET_BENCH_SKIP_BUILD=1           Skip `bun run build`.
SIGNET_BENCH_KEEP_WORKSPACE=1       Keep the isolated workspace after the run.
SIGNET_BENCH_PROFILE=<profile>      rules or supermemory-parity, default rules.
SIGNET_BENCH_RUN_ID=<id>            Override the MemoryBench run id.
SIGNET_BENCH_JUDGE=<model>          Default judge model, default gpt-4o.
SIGNET_BENCH_ANSWERING_MODEL=<m>    Default answering model, default gpt-4o.
SIGNET_BENCH_SAMPLE_PER_TYPE=<n>    Default dev sample size, default 1.
SIGNET_BENCH_EMBEDDING_PROVIDER=<p> Generated daemon embedding provider, default native.
SIGNET_BENCH_EMBEDDING_MODEL=<m>    Generated daemon embedding model.
SIGNET_BENCH_EMBEDDING_DIMENSIONS=<n> Generated daemon embedding dimensions.
SIGNET_BENCH_AGENT_ID=<id>          Signet agent scope, default memorybench.
SIGNET_BENCH_PROJECT=<name>         Signet project scope, default memorybench.
SIGNET_BENCH_REQUEST_TIMEOUT_MS=<n> Daemon request timeout, default 60000.
SIGNET_BENCH_SESSION_CONCURRENCY=<n> Per-question session ingest concurrency, default 1, max 16.
SIGNET_BENCH_INGEST_OPENROUTER=1    Use OpenRouter defaults for bench:ingest.
SIGNET_BENCH_OPENROUTER_MODEL=<m>   OpenRouter extraction model, default inception/mercury-2.
SIGNET_BENCH_OPENROUTER_BASE_URL=<u> OpenRouter-compatible base URL override.
MEMORYBENCH_EXTRACTION_MODEL=<m>    Structured extraction model, default gpt-4o.
MEMORYBENCH_EXTRACTION_MAX_TOKENS=<n> Markdown extraction cap, default 1200.
MEMORYBENCH_STRUCTURED_EXTRACTION_MAX_TOKENS=<n> Structured JSON extraction cap, default 1800.
MEMORYBENCH_SESSION_CONCURRENCY=<n> Per-question session ingest concurrency, default 1, max 16.
OPENROUTER_API_KEY                  Preferred injected env var for OpenRouter ingestion.
OPENAI_BASE_URL                     OpenAI-compatible API base URL.

Reports

MemoryBench writes checkpoints and reports under memorybench/data/runs/. That directory is ignored by Git. Reports should be attached to PRs or release notes only when benchmark numbers are being used to justify a memory-system change.