Knowledge Architecture Schema and Traversal Spec

Status: Approved (v1)

Audience: Core + Daemon maintainers

Spec metadata:

ID: knowledge-architecture-schema
Status: approved
Hard depends on: memory-pipeline-v2, session-continuity-protocol, procedural-memory-plan
Blocks: predictive-memory-scorer
Registry: docs/specs/INDEX.md

Related docs:

docs/KNOWLEDGE-ARCHITECTURE.md (conceptual model)
docs/specs/planning/predictive-memory-scorer.md (learned ranking)
docs/specs/complete/memory-pipeline-plan.md (pipeline contracts)
docs/specs/approved/procedural-memory-plan.md (skills as procedural memory)
docs/specs/approved/session-continuity-protocol.md (checkpoint and recovery)

1) Purpose

KNOWLEDGE-ARCHITECTURE.md defines the conceptual model (entity -> aspect -> attribute/constraint, plus dependency traversal). This spec turns that model into an implementation contract with:

additive schema changes
extraction and backfill contracts
traversal-first retrieval contracts
integration points with predictive scoring and continuity checkpoints

This is the structural floor that predictive ranking should run on.

Local dependency graph:

flowchart LR
  MP[memory-pipeline-v2] --> KA[knowledge-architecture-schema]
  SCP[session-continuity-protocol] --> KA
  PM[procedural-memory-plan] --> KA
  KA --> PMS[predictive-memory-scorer]

2) Scope and Non-Goals

In scope

Entity/aspect/attribute/constraint/task representation in SQLite
Dependency edges as explicit graph structure
Session-start traversal contracts for context injection
Cross-spec contracts with scorer, procedural memory, and continuity

Out of scope (this revision)

Multi-hop planning/reasoning beyond one-hop dependency traversal
Automatic task execution
Autonomous destructive mutations without existing policy gates

3) Baseline (Current State)

Current graph-relevant state already in repo:

entities, relations, and memory_entity_mentions exist
session_memories exists and stores candidate/injection telemetry
session_checkpoints exists and stores continuity digests
Predictor crate and training pipeline exist through Phase 2

Current gap:

The system has entity mentions and relation edges, but no first-class representation for aspects, constraints, or task lifecycle. Retrieval is still primarily search/scoring-first, not traversal-first.

4) Cross-Spec Contract Map

Spec	Produces	Consumes from this spec
`memory-pipeline-plan.md`	extraction + mutation pipeline	structural assignment contract, schema ownership, backfill behavior
`predictive-memory-scorer.md`	ranking model + training loop	traversal candidate pool, structural features (entity/aspect/constraint)
`procedural-memory-plan.md`	skill nodes + procedural decay	shared entity/aspect model (`entity_type='skill'`)
`session-continuity-protocol.md`	checkpoint + recovery	focal entity/aspect snapshot for recovery injection and training context

Normative rule: predictive ranking is an enhancer. Traversal-defined structure is the primary retrieval floor.

5) Data Model (Additive)

5.1 Entity type taxonomy

Extend entities.entity_type usage to the concrete, identity-bearing set:

person
organization
project
product
system
tool
artifact
document
source
place
event

Events are first-class entities when they represent real happenings with time, provenance, participants, a target object, or an event type. Abstract concepts, claim slots, policies, actions, workflows, tasks, and prompt roles are not entity types in the extraction path; they live as aspects, attributes, claims, relations, proposal records, or operational task metadata.

5.2 Backfill: `agent_id` on `entities`

The entities table (migration 002) predates the multi-agent scoping invariant. Add agent_id with a default and index:

ALTER TABLE entities ADD COLUMN agent_id TEXT NOT NULL DEFAULT 'default';
CREATE INDEX idx_entities_agent ON entities(agent_id);

All new KA tables include agent_id for database-level tenant isolation. This is not a KA concern — it is the multi-agent invariant applied uniformly. Queries filter by agent_id unless explicitly requesting cross-agent results.

5.3 New table: `entity_aspects`

CREATE TABLE entity_aspects (
  id TEXT PRIMARY KEY,
  entity_id TEXT NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
  agent_id TEXT NOT NULL DEFAULT 'default',
  name TEXT NOT NULL,
  canonical_name TEXT NOT NULL,
  weight REAL NOT NULL DEFAULT 0.5,
  created_at TEXT NOT NULL,
  updated_at TEXT NOT NULL,
  UNIQUE(entity_id, canonical_name)
);

CREATE INDEX idx_entity_aspects_entity ON entity_aspects(entity_id);
CREATE INDEX idx_entity_aspects_agent ON entity_aspects(agent_id);
CREATE INDEX idx_entity_aspects_weight ON entity_aspects(weight DESC);

weight is structural centrality + learned utility. It is not pure frequency.

5.4 New table: `entity_attributes`

CREATE TABLE entity_attributes (
  id TEXT PRIMARY KEY,
  aspect_id TEXT NOT NULL REFERENCES entity_aspects(id) ON DELETE CASCADE,
  agent_id TEXT NOT NULL DEFAULT 'default',
  memory_id TEXT REFERENCES memories(id) ON DELETE SET NULL,
  kind TEXT NOT NULL,                 -- 'attribute' | 'constraint'
  content TEXT NOT NULL,
  normalized_content TEXT NOT NULL,
  confidence REAL NOT NULL DEFAULT 0.0,
  importance REAL NOT NULL DEFAULT 0.5,
  status TEXT NOT NULL DEFAULT 'active',  -- 'active' | 'superseded' | 'deleted'
  superseded_by TEXT,
  created_at TEXT NOT NULL,
  updated_at TEXT NOT NULL
);

CREATE INDEX idx_entity_attributes_aspect ON entity_attributes(aspect_id);
CREATE INDEX idx_entity_attributes_agent ON entity_attributes(agent_id);
CREATE INDEX idx_entity_attributes_kind ON entity_attributes(kind);
CREATE INDEX idx_entity_attributes_status ON entity_attributes(status);

Constraints are first-class rows (kind='constraint'), not inferred tags.

5.5 New table: `entity_dependencies`

CREATE TABLE entity_dependencies (
  id TEXT PRIMARY KEY,
  source_entity_id TEXT NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
  target_entity_id TEXT NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
  agent_id TEXT NOT NULL DEFAULT 'default',
  aspect_id TEXT REFERENCES entity_aspects(id) ON DELETE SET NULL,
  dependency_type TEXT NOT NULL,      -- 'uses' | 'requires' | 'owned_by' | 'blocks' | 'informs'
  strength REAL NOT NULL DEFAULT 0.5,
  confidence REAL NOT NULL DEFAULT 0.7,
  reason TEXT,
  created_at TEXT NOT NULL,
  updated_at TEXT NOT NULL
);

CREATE INDEX idx_entity_dependencies_source ON entity_dependencies(source_entity_id);
CREATE INDEX idx_entity_dependencies_target ON entity_dependencies(target_entity_id);
CREATE INDEX idx_entity_dependencies_agent ON entity_dependencies(agent_id);

These are explicit traversal edges. They are not similarity artifacts. related_to is a special case: it MUST carry a non-empty human-auditable reason, and every create/update/delete of a dependency edge MUST append an immutable row to entity_dependency_history.

5.5.1 Audit table: `entity_dependency_history`

Every mutation to entity_dependencies is captured by DB-level SQLite triggers (AFTER INSERT, AFTER UPDATE, AFTER DELETE). Application code MUST NOT write audit rows directly; the triggers own that path to guarantee complete coverage including FK cascades and direct SQL.

CREATE TABLE entity_dependency_history (
  id                TEXT PRIMARY KEY,
  dependency_id     TEXT NOT NULL,
  source_entity_id  TEXT NOT NULL,
  target_entity_id  TEXT NOT NULL,
  agent_id          TEXT NOT NULL DEFAULT 'default',
  dependency_type   TEXT NOT NULL,
  event             TEXT NOT NULL,   -- 'created' | 'updated' | 'deleted' | 'backfill'
  changed_by        TEXT NOT NULL,   -- 'db-trigger' for all live events
  reason            TEXT NOT NULL,
  previous_reason   TEXT,
  metadata          TEXT,
  created_at        TEXT NOT NULL DEFAULT (datetime('now'))
);

CREATE INDEX idx_entity_dependency_history_dep     ON entity_dependency_history(dependency_id);
CREATE INDEX idx_entity_dependency_history_agent   ON entity_dependency_history(agent_id);
CREATE INDEX idx_entity_dependency_history_created ON entity_dependency_history(created_at DESC);

Integration contracts:

changed_by is always 'db-trigger' for live mutations; 'migration-N' for backfill rows.
reason is never nullable — backfill rows use 'legacy dependency without recorded reason' when the source row has no reason.
previous_reason captures the prior value on updates; NULL on creates and deletes.
created_at uses SQLite datetime('now') format (YYYY-MM-DD HH:MM:SS) throughout — no mixed ISO-8601 timestamps.

5.6 New table: `task_meta`

CREATE TABLE task_meta (
  entity_id TEXT PRIMARY KEY REFERENCES entities(id) ON DELETE CASCADE,
  agent_id TEXT NOT NULL DEFAULT 'default',
  status TEXT NOT NULL,                -- 'open' | 'in_progress' | 'blocked' | 'done' | 'cancelled'
  expires_at TEXT,
  retention_until TEXT,
  completed_at TEXT,
  updated_at TEXT NOT NULL
);

CREATE INDEX idx_task_meta_agent ON task_meta(agent_id);
CREATE INDEX idx_task_meta_status ON task_meta(status);
CREATE INDEX idx_task_meta_retention ON task_meta(retention_until);

Tasks share entity structure but use separate lifecycle rules.

6) Structural Assignment (Two-Pass Architecture)

Structural assignment uses a two-pass architecture to balance speed and accuracy. Pass 1 runs synchronously on the hot path with no LLM call. Pass 2 runs in the background as separate pipeline jobs.

6.1 Pass 1: Heuristic entity linking (synchronous, no LLM)

After fact extraction and entity persistence, the pipeline links each written fact memory to its primary entity:

Resolve primary entity from the extraction triple’s source field (already persisted by txPersistEntities)
Create a stub entity_attributes row with aspect_id = NULL and kind = 'attribute' (default)
Enqueue two background jobs: structural_classify and structural_dependency

Pass 1 does NOT attempt aspect classification or constraint detection. It only establishes the fact → entity link. This is cheap and reliable — the extraction already identifies entities.

6.2 Pass 2a: Structural classification (background, LLM)

A dedicated LLM prompt classifies each unassigned fact into an aspect and determines whether it is an attribute or constraint.

Input per batch (max 8-10 facts):

The parent entity (name, type, existing aspects)
Suggested aspect patterns for the entity type
The fact content

Output per fact:

aspect — existing or new aspect name
kind — 'attribute' or 'constraint'
new — whether this creates a new aspect

Job type: structural_classify. Same lease/retry/dead-letter mechanics as extraction jobs. Batched by entity to provide aspect context.

6.3 Pass 2b: Dependency extraction (background, LLM)

A separate LLM prompt identifies structural dependencies between entities implied by fact content.

Input per batch (max 5 facts):

The source entity
The fact content
Known entities in the graph (for target resolution)

Output per fact:

dep_target — target entity name (or null)
dep_type — 'uses' | 'requires' | 'owned_by' | 'blocks' | 'informs' (or null)

Pre-filter: only facts whose extraction triples reference other entities are sent to this pass. Pure self-referential facts skip it entirely.

Job type: structural_dependency. Independent queue from classification.

6.4 Assignment invariants

Every active atomic fact memory should map to exactly one primary entity_attributes row.
Constraints always map to kind='constraint'.
Dependency edges are additive and idempotent.
superseded attributes remain auditable; they do not vanish.
Pass 2a and 2b are isolated — errors in one do not affect the other.
Facts with aspect_id = NULL are valid (awaiting classification).

6.5 Backfill behavior

Maintenance worker backfills unassigned legacy memories incrementally:

scan memories with no entity_attributes row in batches
run pass 1 (entity linking) then enqueue pass 2 jobs
skip low-confidence rows and record telemetry
never block foreground hooks

6.6 Model constraints (tested against qwen3:4b)

Classification prompt handles 8-10 facts per batch reliably
Dependency prompt handles 5 facts per batch reliably
Beyond these limits, the model drops facts or loses format discipline
Prompt must use short JSON field names and minimal boilerplate
/no_think flag suppresses chain-of-thought for structured output
temperature: 0.1 for deterministic classification
Prompt specifications are documented in the KA-2 sprint brief

7) Retrieval Contract (Traversal First)

7.1 Session-start context assembly

Order of operations:

Resolve focal entities from session signals (project path, checkpoint, session key lineage, prompt hints)
Pull all active constraints for focal entities and one-hop dependencies
Pull top aspects by weight for each focal entity
Pull active attributes under those aspects
Materialize candidate memory IDs via entity_attributes.memory_id

This produces a structurally coherent candidate pool before heuristic or model ranking runs.

7.2 Candidate pool fusion with predictor pre-filter

Predictor pre-filter contract changes from:

effective top-50 U embedding top-50

to:

traversal pool U effective top-50 U embedding top-50

Then dedupe and cap (configurable, default 100).

7.3 Hard retrieval invariant

Constraints are always surfaced when their entity is in scope, independent of score rank.

8) Predictive Scorer Integration

predictive-memory-scorer.md consumes this spec in three places:

Candidate quality: scorer receives structurally coherent candidates
Feature enrichment: add structural features per candidate
- entity slot hash
- aspect slot hash
- is_constraint
Evaluation slices: report win/loss by focal entity/project, not only global EMA

The predictor still earns influence via comparisons. This spec improves its input quality and interpretability.

9) Procedural Memory Integration

procedural-memory-plan.md remains authoritative for skill lifecycle.

Alignment rules:

Skills remain entity_type='skill'
Skill metadata (skill_meta) remains source-of-truth for runtime skill behavior
Skill knowledge can also map into entity_aspects / entity_attributes for unified traversal and scoring

This keeps one graph with type-specific lifecycle rules.

10) Continuity Protocol Integration

session-continuity-protocol.md integration points:

checkpoint digests add optional structural snapshot fields:
- focal entities
- active aspects
- surfaced constraints
recovery injection should prioritize these structural snapshots over raw narrative when budget is tight
predictor label quality improves when session-end evaluation knows which constraints and aspects were in play

11) Migration and Phase Plan

KA-1 Schema and types

Add migration 019-knowledge-structure.ts:
- Backfill agent_id on entities table
- Create entity_aspects, entity_attributes, entity_dependencies, task_meta — all with agent_id column
Add core types and read/write helpers

KA-2 Structural assignment in pipeline

Add assignment stage in summary/extraction path
Persist mappings for newly extracted atomic facts
Add telemetry for assignment confidence and coverage

KA-3 Traversal retrieval path

Add traversal query builder in daemon
Wire session-start and recall flows to include traversal candidates
Enforce constraint surfacing invariant

KA-4 Predictor coupling

Extend predictor request payload with structural features
Update comparison/audit APIs with structural slices

KA-5 Continuity + dashboard

Store structural checkpoint slices
Surface entity/aspect/constraint context in dashboard timeline and predictor inspector

12) Acceptance Criteria

=90% of active atomic fact memories have structural assignment (entity + aspect + attribute/constraint)
Session-start context includes constraint rows for in-scope entities with zero omissions in test fixtures
Traversal candidate pool remains bounded and deterministic
Predictor comparison reports include structural slices (entity/project)
Recovery injections include structural snapshot fields when available

13) Open Questions

Should aspects be free-form with canonicalization, or backed by a small taxonomy per entity type?
Should task retention default to fixed duration or confidence-driven decay?
Do we need a dedicated constraints table later for policy-level joins, or is entity_attributes(kind='constraint') sufficient?

14) Immediate Next Steps

Approve this spec as the implementation contract for structural retrieval.
Update predictive scorer Phase 3 tasks to include traversal pool fusion.
Draft migration 019-knowledge-structure.ts with exact indexes and idempotency behavior.
Add a small offline benchmark set comparing traversal-first candidate generation vs current heuristic pre-filter.