Agent Loop Comparison: Forge, Codex, Hermes
Comparative analysis of agentic loop architectures across three harnesses, identifying best-of-breed patterns to inform Forge’s evolution.
Harnesses analyzed:
- Forge (Rust, ~1,400 LOC across 6 modules) — Signet’s native terminal client
- Codex (Rust, multi-crate workspace) — OpenAI’s CLI agent
- Hermes (Python, ~7,500+ LOC in run_agent.py alone) — RL-oriented research agent
Date: 2026-03-27
1. Core Loop Structure
All three share the same fundamental skeleton:
user input -> build prompt -> call LLM -> stream response
-> tool calls? -> execute tools -> append results -> loop back
-> no tool calls? -> turn complete
The differences lie in how each harness handles the transitions, failures, and resource constraints within that skeleton.
Forge
Entry point: agent_loop.rs:process_message() (~575 lines).
A single loop {} that calls the provider, streams the response, collects
tool calls, executes them sequentially, appends results, and loops. Exits
when the model responds without tool calls, or on error/loop detection.
Memory recall and provider preconnect run in parallel via tokio::join!
on the first pass, then memory context is cleared for subsequent iterations.
Clean, linear, easy to reason about.
Codex
Entry point: codex.rs:run_turn() with an outer submission_loop.
The loop is embedded in a larger task/submission architecture. RegularTask
wraps turns in a task context, and a submission loop dispatches user input,
exec approvals, and user answers as separate Op variants. Pre-sampling
compaction runs before the LLM call; post-sampling compaction runs after
heavy tool use. Pending user input queued during model execution is consumed
on the next iteration.
More ceremony, but handles concurrent user interaction gracefully.
Hermes
Entry point: run_agent.py:run_conversation() (~7,500+ lines).
The most feature-dense loop. Bounded by both max_iterations (default 90)
and a shared iteration_budget that subagents draw from. Includes preflight
compression, Honcho context injection, prompt caching for Claude models via
OpenRouter, and turn-based skill/memory nudge triggers. The loop body handles
tool validation, deduplication, parallel dispatch decisions, and categorized
error recovery, all inline.
Powerful but dense. The single-file approach makes it hard to isolate concerns.
2. Context Window Management
This is where the harnesses diverge most significantly.
Forge (Reactive, Simple)
- Trigger: estimated tokens > 90% of context window
- Estimation:
text_length / 4heuristic (no tokenizer) - Strategy: keep last 2 messages, summarize everything older via LLM
- Failure mode: log warning, continue without compaction
- Timing: checked at end of turn (when no tool calls remain)
Pros: simple, predictable, low overhead. Cons: reactive only, so a single heavy tool response could blow the window before compaction runs. no proactive check before the LLM call.
Codex (Dual-Pass, Proactive)
- Pre-sampling compact: runs BEFORE the LLM call if approaching limits
- Post-sampling compact: runs AFTER if token usage exceeds
auto_compact_limitand follow-up is needed - Model switch detection: reinjects context when model/window size changes mid-session
- History reinjection: rebuilds prompt when
previous_turn_settingsdiffer
Pros: proactive compaction prevents wasted LLM calls on oversized prompts. model switch handling is a real-world concern most harnesses ignore. Cons: more complex state tracking.
Hermes (Adaptive, Resilient)
- Trigger: ~80% of discovered context threshold
- Discovery: parses error messages to detect actual limits, probes smaller tiers, caches results to
~/.hermes/context_lengths.json - Strategy: iterative compression that updates previous summaries to avoid info loss, protects head and tail messages
- Retries: up to 3 compression attempts per conversation
- Auxiliary LLM: uses a cheap/fast model for summarization (doesn’t burn primary model tokens)
Pros: adapts to unknown models without hardcoded limits. iterative summary updates preserve more context than single-pass summarization. separate summarization model is cost-efficient. Cons: error-message parsing for limit detection is fragile.
Best Approach
Combine Codex’s proactive pre-sampling check with Hermes’ iterative compression and auxiliary model for summarization. The pre-sampling check prevents wasted calls; iterative compression preserves more signal; and a cheap summarization model keeps costs down.
Forge’s current reactive-only approach is the most significant gap.
3. Tool Execution
Forge (Sequential, Interactive)
Tools execute one at a time in a for loop. Each tool gets its own
permission check and TUI progress update. Results are batched into a
single message after all tools complete.
Pros: simple, debuggable, great UX for interactive use. per-tool permission prompts are natural. Cons: serial execution of independent read-only tools is unnecessary latency.
Codex (Parallel, Coordinated)
ToolCallRuntime coordinates concurrent execution of multiple tool calls
from a single LLM response. Results collected and streamed back as a batch.
Sandbox denial triggers retry without sandbox if policy permits.
Pros: significant throughput improvement when the model requests multiple independent operations (common with read/grep/glob batches). Cons: parallel execution complicates error attribution and permission flow.
Hermes (Conditional Parallel)
_should_parallelize_tool_batch() analyzes the batch to determine safety:
- Read-only tools (read_file, search_files, web_search) always parallelize
- File tools parallelize when targeting independent paths
- Interactive tools (terminal, browser) always sequential
- Falls back to sequential for anything with side effects
Thread pool with configurable workers. Deduplicates identical tool calls
before execution. Caps delegate_task calls to prevent agent-stack explosion.
Pros: best of both worlds. gets parallel throughput where safe, sequential correctness where needed. deduplication catches a real failure mode. Cons: the safety analysis adds complexity and can be wrong for edge cases.
Best Approach
Hermes’ conditional parallelism is the right model. Forge should parallelize read-only tool batches (read, grep, glob, websearch, webfetch) while keeping write/bash/edit sequential. This captures most of the throughput benefit with minimal risk.
Tool call deduplication before execution is also worth adopting. models sometimes emit identical calls in the same response.
4. Error Handling and Retry Strategy
Forge (Fail Fast)
- Provider errors terminate the turn immediately
- Loop detection (3 identical tool calls) is the primary safety valve
- Tool JSON parse failures get empty-object fallbacks
- No retry on LLM failures
- No fallback models
Pros: predictable, no hidden retry loops burning tokens. Cons: a single transient 500 from the provider kills the entire turn. the user has to manually retry.
Codex (Targeted Recovery)
- Sandbox denial: retry without sandbox if policy allows
- Invalid image: sanitize and retry
- Approval caching: similar commands reuse cached decisions
- Turn abort: clean shutdown on user interrupt
- Network proxy failures trigger auth refresh with 10s timeout
Pros: handles the most common real-world failure modes. approval caching reduces friction without reducing safety. Cons: doesn’t address transient provider failures.
Hermes (Categorized, Exhaustive)
Errors are bucketed into distinct categories with tailored recovery:
| Category | Strategy | Backoff |
|---|---|---|
| Rate limit / empty response | Extended backoff, eager model fallback | 5s, 10s, 20s, 40s, 80s, 120s |
| Context length exceeded | Trigger compression, retry up to 3x | Immediate |
| Client errors (4xx) | Try fallback model, then give up with diagnostics | None |
| Transient errors (5xx, timeout) | Exponential backoff with interrupt check | 2s, 4s, 8s, 16s, 32s, 60s |
| Invalid tool calls | Return error to model for self-correction (3x) | Immediate |
| Length truncation | Request continuation (3x), rollback if unrecoverable | Immediate |
| Incomplete scratchpad | Retry up to 2x | Immediate |
Fallback model system: on repeated failures, switches to a secondary model, resets retry counter.
Pros: remarkably resilient. handles every failure mode the real world throws at a multi-provider agent. the fallback model system is genuinely useful for production reliability. Cons: the complexity is substantial. 7,500 lines in a single file is partly because of this exhaustive error handling.
Best Approach
Forge should adopt a tiered retry strategy without going full Hermes:
- Transient errors (5xx, timeouts): exponential backoff, 3 retries max. this alone would fix the most common pain point.
- Rate limits (429): longer backoff with a budget cap.
- Invalid tool calls: return the error to the model for self-correction (up to 3 attempts). models are good at fixing their own tool call JSON when told what went wrong.
- Context overflow: trigger compaction and retry once.
Fallback models are worth considering but add provider-management complexity that may not justify itself yet.
5. Permission Systems
Forge (Three-Tier Interactive)
Three levels: ReadOnly (auto-approve), Write (check approval list or prompt),
Dangerous (always prompt). Approvals are session-scoped via Arc<Mutex<>>.
The TUI can modify approval state mid-turn.
Codex (Policy-Based Sandboxing)
Four sandbox levels: ReadOnly, WorkspaceWrite, DangerFullAccess,
ExternalSandbox. Approval decisions cached by command hash, so similar
commands reuse prior decisions. Network requests audited separately via
NetworkApprovalService.
Hermes (Task-Scoped Isolation)
Each tool call gets a unique task_id for terminal/browser isolation.
Parallel tools run in isolated threads. Resource cleanup on conversation end.
Less about explicit permission prompts, more about execution isolation.
Best Approach
Forge’s interactive model is the right fit for a terminal client. The
improvement worth borrowing from Codex is approval caching by command
hash. If the user approves bash: bun test, subsequent bash: bun test
calls in the same session shouldn’t re-prompt. Forge already has
session-scoped “always allow” per tool name, but command-level granularity
would reduce friction for repetitive workflows.
6. Streaming and Backpressure
Forge
Simple channel send from agent loop to TUI. No explicit backpressure handling. Tool detail extraction (file paths, commands, patterns) provides rich TUI context. Path shortening shows last 2 segments.
Codex
Bounded channels (8192 capacity). Critical events (TurnCompleted) use
blocking sends for guaranteed delivery. Non-critical notifications dropped
with warning on queue saturation. Dual tokio::select! architecture
multiplexes client requests and server events.
Hermes
90s stale-stream detection, 60s read timeout. Reconstructs tool calls from streamed JSON fragments. Always prefers streaming even without consumers for health checking purposes.
Best Approach
Forge should adopt bounded channels with priority delivery from Codex.
TurnComplete and error events should never be dropped. Hermes’ stale
stream detection is also worth considering: if no tokens arrive for 90s,
something is wrong, and the user should know rather than staring at a
frozen screen.
7. Unique Strengths Worth Noting
Forge
- Runtime-configurable effort and bypass via shared
Arc<Mutex<>>. the TUI can change reasoning effort mid-turn without interrupting the loop. neither reference harness does this. - Cached tool definitions computed once at init. small optimization but it matters when the tool registry is large.
- Signet-native integration for memory, identity, secrets, skills, and MCP. the daemon communication layer is clean and purpose-built.
Codex
- Model switch detection with context reinjection. if the user changes models mid-session, Codex detects the window size change and rebuilds the prompt accordingly. this is a real-world edge case that causes silent failures in other harnesses.
- Pending input queue for messages submitted during model execution. the submission loop consumes them on the next iteration rather than dropping or blocking.
- Hook system with pre-turn, user-prompt, stop, and after-agent hooks. stop hooks can intercept model completion and inject continuation prompts.
Hermes
- Prompt caching for Anthropic models via OpenRouter. auto-injects cache_control breakpoints on system prompt and last 3 messages. reports hit rate. reduces input tokens by ~75% on multi-turn conversations.
- Fallback model system that switches providers on repeated failures. genuine production resilience.
- Dynamic context limit discovery that adapts to unknown models. especially useful for open-source models with undocumented context windows.
- Tool call deduplication and delegate_task capping to prevent runaway subagent spawning.
8. Recommendations for Forge
Ordered by impact relative to implementation cost.
High Impact, Moderate Effort
-
Proactive context compaction. Check token estimate before the LLM call, not just after. Prevents wasted calls on oversized prompts. Codex does this and it’s the single most effective context management improvement.
-
Tiered retry with exponential backoff. Three retries for transient 5xx/timeout errors, with 2s/4s/8s backoff. Covers the most common failure mode (provider hiccups) without adding significant complexity.
-
Conditional parallel tool execution. Parallelize read-only tool batches (read, grep, glob, websearch, webfetch). Keep write/bash/edit sequential. Biggest throughput win for common workflows where the model requests multiple file reads.
Medium Impact, Low Effort
-
Stale stream detection. If no tokens arrive for N seconds (60-90s), surface a warning in the TUI. Users shouldn’t have to guess whether the model is thinking or the connection died.
-
Tool call deduplication. Before executing a batch, deduplicate by (name, input) hash. Models sometimes emit identical calls. Cheap to implement, prevents wasted work.
-
Invalid tool call self-correction. Instead of falling back to an empty object on JSON parse failure, return the error to the model with the schema it should have used. Models are good at fixing this. Cap at 3 retries.
Lower Priority, Higher Effort
-
Approval caching by command hash. Cache permission decisions for specific commands within a session, not just by tool name. Reduces friction for repetitive workflows.
-
Bounded event channels with priority delivery. Guarantee delivery of critical events (TurnComplete, errors) even under backpressure.
-
Auxiliary model for summarization. Use a cheap/fast model for context compaction instead of the primary model. Cost optimization for long sessions.
-
Model switch detection. Detect when the user changes models mid-session and adjust context window expectations accordingly.
Appendix: Architecture Comparison
| Dimension | Forge | Codex | Hermes |
|---|---|---|---|
| Language | Rust | Rust | Python |
| Loop LOC | ~575 | ~1,500+ | ~7,500+ |
| Tool execution | Sequential | Parallel | Conditional parallel |
| Context compaction | Reactive (90%) | Dual-pass (pre+post) | Adaptive (80%, iterative) |
| Token estimation | text/4 heuristic | Token counter | Token counter + error parsing |
| Retry strategy | None (fail fast) | Targeted (sandbox, images) | Categorized (6+ categories) |
| Permission model | 3-tier interactive | Policy + sandbox + cache | Task-scoped isolation |
| Streaming backpressure | None | Bounded channels | Timeout detection |
| Parallel tool exec | No | Yes | Conditional |
| Fallback models | No | No | Yes |
| Loop detection | Hash-based (3x) | No | No |
| Context limit discovery | Hardcoded | Hardcoded | Dynamic (error parsing) |
| Prompt caching | No | No | Yes (Anthropic via OpenRouter) |
| Model switch handling | No | Yes | No |