Document Ingest
The document ingest subsystem lets you push arbitrary text content or URLs
into Signet’s Memory store. Documents are split into overlapping chunks,
each chunk is embedded and written as a document_chunk memory, and the
whole set is linked back to the originating document record. Once indexed,
chunks participate in hybrid search alongside any other memory type.
This doc covers the ingest API, processing lifecycle, chunking mechanics, and worker Configuration. For full endpoint signatures and auth details, see Api.
Source Types
Every document submission declares a source_type:
text— raw string content supplied directly in the request body.url— a remote URL the daemon fetches and extracts text from.file— a local or remote file path (URL field used for path reference).
For url sources, content is fetched at processing time by the worker,
not at submission time. The title field is also populated from the page
<title> if not supplied in the request.
API Endpoints
POST /api/documents
Submits a document for ingestion. The document is written to the database
with status queued and a background job is enqueued immediately.
Request body:
{
"source_type": "text",
"title": "Optional title",
"content": "The full document text.",
"content_type": "text/plain",
"metadata": { "arbitrary": "key-value pairs" }
}
For url source type, replace content with url:
{
"source_type": "url",
"url": "https://example.com/article",
"title": "Optional override title"
}
Response on success (201 Created):
{ "id": "<uuid>", "status": "queued" }
If the same URL is already tracked in a non-failed, non-deleted state, the endpoint returns the existing record instead of creating a duplicate:
{ "id": "<existing-uuid>", "status": "done", "deduplicated": true }
GET /api/documents
Lists all documents, ordered by creation date descending.
Query parameters:
status— filter by lifecycle status (e.g.?status=done).limit— page size, default50, maximum500.offset— pagination offset, default0.
Response:
{
"documents": [ /* document rows */ ],
"total": 42,
"limit": 50,
"offset": 0
}
GET /api/documents/:id
Returns the full document row for the given ID, including status, chunk count, memory count, error message if any, and raw metadata.
Returns 404 if the ID is not found.
GET /api/documents/:id/chunks
Returns all memory entries linked to a document, ordered by chunk index.
Only returns chunks where is_deleted = 0.
Response:
{
"chunks": [
{
"id": "<memory-uuid>",
"content": "chunk text...",
"type": "document_chunk",
"created_at": "2025-01-01T00:00:00.000Z",
"chunk_index": 0
}
],
"count": 7
}
DELETE /api/documents/:id
Soft-deletes the document and all linked memories. Requires a reason
query parameter (?reason=outdated). Each linked memory gets its
is_deleted flag set to 1 and a deletion event is appended to
memory_history. The document row is set to status deleted.
Response:
{ "deleted": true, "memoriesRemoved": 7 }
Status Lifecycle
A document moves through these statuses during processing:
queued → extracting → chunking → embedding → indexing → done
At any stage the worker may transition the document to failed if an
unrecoverable error occurs. Deleted documents land in deleted status
after a DELETE request.
| Status | Meaning |
|---|---|
queued | Job enqueued, worker has not picked it up yet. |
extracting | Fetching remote content (URL sources only). |
chunking | Splitting content into chunks. |
embedding | Requesting embedding vectors for each chunk. |
indexing | Final writes — memory records and FTS index update. |
done | All chunks embedded and indexed. |
failed | Processing error. See error field on the document. |
deleted | Document removed via DELETE endpoint. |
Status transitions are written inside write transactions so reads always see a consistent state.
Chunking Strategy
Text is split into fixed-size windows with a configurable overlap. The overlap ensures that context spanning a chunk boundary is not silently dropped at retrieval time.
The splitter is character-based, not token-based. Given chunkSize = 2000
and chunkOverlap = 200:
- Chunk 0 covers characters
0..2000. - Chunk 1 covers characters
1800..3800. - Chunk 2 covers characters
3600..5600. - And so on until the end of the document.
Documents shorter than chunkSize are stored as a single chunk. Empty
chunks (whitespace only) are skipped and do not produce memory entries.
Default values:
| Parameter | Default | Min | Max |
|---|---|---|---|
documentChunkSize | 2000 | 200 | 50000 |
documentChunkOverlap | 200 | 0 | 10000 |
documentMaxContentBytes | 10 MB | 1 KB | 100 MB |
See the Configuration section below for how to override these.
Document-to-Memory Linking
Each chunk that passes the dedup check becomes an entry in the memories
table with type = "document_chunk". Two join tables connect the pieces:
document_memories— links adocument_idto amemory_idwith the chunk’s sequentialchunk_index. This is the primary join for retrieval by document.embeddings— stores the vector blob for each chunk, keyed oncontent_hash. If two documents share identical chunk text they will share the same embedding row.
Deduplication happens at the chunk level: before writing a new memory the worker checks whether a non-deleted memory with the same content hash is already linked to the document. Matching chunks are silently skipped.
Tags on chunk memories take the form document:<title>, making it easy to
filter hybrid search results by source document.
Worker Model
startDocumentWorker() runs a polling loop inside the Daemon process. It
shares the same lease-based job table (memory_jobs) used by the
extraction worker. The job type is document_ingest.
On each tick the worker:
- Atomically claims one pending job by setting its status to
leasedand incrementingattempts. - Looks up the associated document row.
- Runs it through the full
extracting → chunking → embedding → indexingsequence, updating document status at each stage. - Marks the job
completedand the documentdone.
If processing throws, the job is retried up to workerMaxRetries times
(default 3). After exhausting retries the job moves to dead and the
document to failed.
Embedding calls are intentionally made outside write transactions to avoid holding a SQLite write lock during network I/O. Each chunk’s memory write is its own short transaction.
The worker is started alongside the extraction pipeline in
packages/daemon/src/pipeline/index.ts and stops cleanly when the daemon
shuts down.
Configuration
Document worker settings live under the pipeline key in agent.yaml
(or the equivalent daemon config). All values are validated and clamped
on load.
pipeline:
documentChunkSize: 2000 # chars per chunk
documentChunkOverlap: 200 # overlap between adjacent chunks
documentWorkerIntervalMs: 10000 # polling interval in milliseconds
documentMaxContentBytes: 10485760 # max fetched content for URL sources
documentWorkerIntervalMs controls how frequently the worker polls for
new jobs. The minimum is 1000 ms; the maximum is 300000 ms (5 min).
Changes to chunk size do not retroactively re-chunk existing documents. To re-process a document with new chunk settings, delete it and resubmit.