Agent Harnesses: Technical Spark Notes

A deep architectural breakdown for builders, not observers.

Part 1: What a Harness Actually Is

The One-Sentence Definition

A harness is the persistent runtime environment that wraps an LLM to transform it from a stateless text-completion engine into a durable, tool-using, memory-having autonomous agent.

The Analogy That Actually Holds Up

Model = CPU. Context window = RAM. Harness = Operating System.

A CPU without an OS can execute instructions but can't manage processes, allocate memory, handle I/O, enforce permissions, or recover from crashes. Similarly, a frontier model without a harness can generate brilliant text but can't persist state between sessions, safely execute code, remember what it did yesterday, recover from errors mid-task, or coordinate with other agents.

The harness is what makes raw intelligence operational.

What a Harness Is NOT

Term	What It Is	How It Differs From a Harness
Wrapper	Thin API translation layer (e.g., OpenAI SDK → Anthropic SDK)	A component inside a harness, not a harness itself
Framework	Building blocks + abstractions (LangChain, CrewAI, Mastra)	Gives you Legos. A harness gives you a built house you can remodel
Runtime	Durable execution infrastructure (LangGraph, Temporal)	Handles crash recovery and persistence, but not planning, context, or tools
Orchestrator	Control flow manager ("do step A, then B, then C")	The "brain" deciding what happens next. The harness is the "body" ensuring it happens safely
Scaffold	One-time setup structure	Temporary. A harness is the persistent operating environment
Agent OS	Near-synonym	"Agent OS" is a metaphor. "Harness" is a concrete software system

The Taxonomy (Harrison Chase / LangChain, Oct 2025)

┌──────────────────────────────────────────────┐
│              HARNESS                          │
│  Opinionated, ready-to-run agent with        │
│  defaults for prompts, planning, tools,      │
│  memory, context management, sub-agents      │
│                                              │
│  ┌────────────────────────────────────────┐  │
│  │           RUNTIME                      │  │
│  │  Durable execution, persistence,       │  │
│  │  crash recovery, streaming             │  │
│  │                                        │  │
│  │  ┌──────────────────────────────────┐  │  │
│  │  │        FRAMEWORK                 │  │  │
│  │  │  Abstractions, building blocks,  │  │  │
│  │  │  model interfaces, tool schemas  │  │  │
│  │  └──────────────────────────────────┘  │  │
│  └────────────────────────────────────────┘  │
└──────────────────────────────────────────────┘

The critical distinction: frameworks require assembly, harnesses work out of the box. A framework says "here are the parts, build what you want." A harness says "here's a working agent — customize it."

Part 2: The Seven Architectural Pillars

Every production harness implements these seven subsystems. The quality of each determines whether the agent reliably completes work or produces expensive hallucinations.

Pillar 1: Sandbox & Isolation

What it solves: Agents execute code, modify files, and make network requests. Without isolation, a single hallucinated rm -rf / destroys the host system.

The isolation hierarchy (weakest → strongest):

Level	Technology	Cold Start	Use Case
Process-level	V8 isolates, Deno	~1ms	Lightweight scripting
Container	Docker, Podman	~500ms	Standard agent sandboxing
Container + kernel hardening	gVisor, seccomp	~800ms	Defense-in-depth
MicroVM	Firecracker, Kata	~150ms	Maximum isolation (E2B, OpenAI Codex cloud)
Full VM	QEMU/KVM	~2-5s	Legacy, rarely used for agents

Key architectural decisions:

Network access: OpenAI Codex disables network by default in cloud tasks. This is the security-first posture. Agents that need network get explicit allowlists.
Filesystem scope: Each agent task gets its own filesystem overlay (copy-on-write). Changes are ephemeral unless explicitly committed. OpenAI achieves this with per-worktree git checkouts — each Codex run gets an isolated copy of the repo.
Resource limits: CPU time, memory, disk I/O, and execution duration are capped. Without these, a confused agent can burn infinite compute.
Teardown: The sandbox and everything in it (logs, metrics, temp files) are destroyed when the task completes. This prevents cross-contamination between runs.

The emerging pattern: Universal sandbox adapters (e.g., Rivet's sandbox-agent) that provide a single HTTP API in front of any sandbox provider. The agent doesn't know or care whether it's running in Firecracker, Docker, or bare metal — the harness abstracts this entirely.

Pillar 2: Context Engineering

What it solves: The context window is the agent's entire reality. Everything it can't see in-context effectively doesn't exist. Context engineering is the discipline of constructing what the agent sees and when.

Why this is the hardest problem:

Even models with 1M+ token windows degrade well before that limit. Manus discovered a practical "Pre-Rot Threshold" of ~256K tokens — beyond that, performance collapses regardless of advertised capacity. The effective context window is far smaller than the theoretical one.

The four core techniques:

1. Progressive Disclosure (a.k.a. "Give them a map, not a manual")

Don't dump everything into context at once. Structure knowledge in layers:

Layer 0: AGENTS.md (~100 lines)
  → "Here's who you are, here's the table of contents, here's where to look"

Layer 1: ARCHITECTURE.md, FRONTEND.md, SECURITY.md
  → Domain-specific guidance, loaded on-demand when the task touches that area

Layer 2: docs/design-docs/, docs/product-specs/
  → Detailed specs, loaded only when actively working on that feature

Layer 3: The code itself
  → Read individual files as needed, never wholesale

OpenAI learned this the hard way. They tried a monolithic AGENTS.md and it failed because: (a) it crowded out the actual task, (b) when everything is "important" nothing is, and (c) it rotted instantly. The solution: AGENTS.md as a table of contents, not an encyclopedia.

2. Context Compaction

Long-running agents accumulate conversation history that eventually fills the window. Compaction strategies:

Periodic summarization: Every N turns, summarize the conversation so far and replace the full history with the summary. Anthropic does this automatically.
Sliding window with anchors: Keep the system prompt + last N messages + any "pinned" messages (key decisions, error states).
Tool call pruning: Tool inputs/outputs are verbose. After processing, replace them with a summary of what was learned.

3. Initializer Agent Pattern (Anthropic)

Use a different prompt for the first context window versus subsequent ones. The initializer agent acts as a boot sequence — it reads the repository, understands the task, creates a plan file, then hands off to the "worker" agent that actually executes. This is equivalent to BIOS → OS boot → application launch.

4. Attention-Aware Placement

LLMs attend differently to different positions in context. Manus exploits this with their todo.md pattern — the task list is constantly rewritten to stay at the end of the context window, where recency bias keeps it most salient. Instructions at the beginning benefit from primacy effects. The dead zone is the middle.

The anti-patterns:

Monolithic instruction files: Rot instantly, crowd out the task, provide non-guidance.
Context pollution: Loading irrelevant information "just in case" — every token has an opportunity cost.
Few-shot self-repetition: Agents that see their own prior outputs start copying patterns rather than reasoning fresh.
Context debt: Ambiguous, contradictory, or outdated agent instructions that accumulate over time. This is the equivalent of technical debt but for agent configuration.

Pillar 3: Memory Management

What it solves: LLMs are stateless. Every API call starts from zero. Memory gives agents continuity across sessions.

The three memory tiers:

Tier	Scope	Implementation	Example
Working memory	Current task	The active context window	"I'm currently debugging the login form"
Session memory	Current session, survives context compaction	File-based checkpoints (`claude-progress.txt`, `todo.md`)	"I've completed steps 1-3, step 4 is in progress"
Long-term memory	Cross-session, persistent	Vector databases, LangGraph Store, filesystem	"This user prefers TypeScript, their API key is stored at ~/.config"

File-based memory dominates in practice:

Anthropic: claude-progress.txt — structured log enabling session handoff. When a context window fills, the next session reads this file to reconstruct state.
OpenAI Codex: AGENTS.md + docs/exec-plans/active/ — execution plans with progress and decision logs, versioned in git.
Manus: todo.md — constantly rewritten task list that serves as both memory and attention anchor.
OpenClaw: All state stored as Markdown/YAML files on local disk. Memory is literally files.

Why file-based > vector DB for most agent memory:

Vector databases excel at semantic search across large corpora. But most agent memory is structured, recent, and task-specific — closer to a scratchpad than a knowledge base. Files are inspectable, diffable, version-controllable, and trivially readable by the agent. The overhead of embedding, indexing, and querying a vector store is often worse than just reading a well-organized file.

Where vector search does matter: Long-term memory across many sessions (e.g., "what did the user tell me about their preferences 3 weeks ago?"), and retrieval over large codebases (finding the right file in a 10,000-file repo).

Pillar 4: Model Routing

What it solves: No single model is optimal for all tasks. Routing sends easy tasks to cheap models and hard tasks to expensive ones.

The routing hierarchy:

User intent → Task classification
                    │
          ┌─────────┼──────────┐
          ▼         ▼          ▼
      Simple     Medium     Complex
     (Haiku)    (Sonnet)    (Opus)
       ~$0.25    ~$3.00     ~$15.00
      per 1M    per 1M     per 1M

What a production router handles:

Task-to-model mapping: "Reformatting this JSON → Haiku. Designing the API architecture → Opus."
Fallback chains: Primary model times out or rate-limits → try secondary → try tertiary.
Provider abstraction: Swap between Anthropic, OpenAI, Google without changing agent code. The harness handles auth, format translation, and error normalization.
Cost accounting: Track spend per task, per agent, per session. Alert when budget thresholds are hit.

The pain point everyone reports: Model routing is awful in practice. Models use different naming conventions, different API formats, different error codes. Auth juggling across providers is a constant source of bugs. This is one of the top four setup pain points cited by practitioners.

Pillar 5: Tool Orchestration

What it solves: Agents need to do things — execute code, browse the web, read files, call APIs, query databases. The harness manages tool registration, invocation, permission, and error handling.

MCP (Model Context Protocol) is the emerging standard:

Introduced by Anthropic (Nov 2024), donated to Linux Foundation (Dec 2025), adopted by OpenAI, Google DeepMind, Microsoft. MCP provides:

A standard JSON-RPC protocol for tool registration and invocation
Client-server architecture where tool servers expose capabilities
Schema-based tool descriptions that models can reason about
Transport-agnostic (stdio, HTTP, WebSocket)

The tool taxonomy:

Category	Examples	Risk Level
Read-only	File read, web search, database query	Low
Reversible writes	File edit with git, branch creation	Medium
Irreversible writes	Email send, database delete, API call with side effects	High (requires human approval)
System-level	Process execution, network configuration, credential access	Critical (sandboxed or prohibited)

Browser automation is the least solved problem:

Harnesses use Chrome DevTools Protocol (CDP), Playwright, or Puppeteer to give agents browser access. OpenAI wired CDP directly into the Codex runtime for DOM snapshots, screenshots, and navigation. But significant limitations persist — agents can't see browser-native alert modals through Puppeteer, vision-based approaches miss entire categories of visual bugs, and complex multi-step UI flows remain fragile.

Pillar 6: Scheduling & Triggers

What it solves: Agents need to run on schedules (cron), in response to events (webhooks), and in long-running loops.

The dominant pattern for long-running work:

┌─────────────────────────────────────────┐
│  OUTER LOOP (harness-managed)           │
│                                         │
│  1. Load/create checkpoint file         │
│  2. Inject into fresh context window    │
│  3. Run agent (5-60 min burst)          │
│  4. Agent writes progress to checkpoint │
│  5. Context window fills or task pauses │
│  6. Harness reads checkpoint            │
│  7. → Go to step 1                     │
│                                         │
│  Repeat until: task complete, budget    │
│  exhausted, or human escalation         │
└─────────────────────────────────────────┘

Anthropic's research found that short focused bursts (~5 minutes) with structured handoff outperform marathon sessions. The Codex team regularly sees single runs work for 6+ hours, but these are structured as many bursts with checkpoint persistence, not one continuous context.

Cron reliability is a major pain point. When a cron job fires while the agent is mid-task, it can corrupt state or override the current context. Solutions:

Task queuing: Cron jobs add to a queue; the agent processes them when idle.
Priority interrupts: Critical events preempt current work, with state saved first.
Heartbeat isolation: Background health checks run in separate agent instances, not in the primary task context.

Pillar 7: Observability

What it solves: When an agent runs for 6 hours autonomously, you need to understand what happened, why it made specific decisions, where it went wrong, and how much it cost.

What a harness must track:

Planning traces: What plan did the agent create? How did it decompose the task?
Tool call logs: Every tool invocation with inputs, outputs, latency, and success/failure.
Context window snapshots: What was in context at each decision point?
Token consumption: Per-turn and cumulative, broken down by model.
Cost accounting: Real-time spend tracking with budget alerts.
State transitions: What checkpoints were written? What decisions were logged?
Handoff points: When did context compaction occur? What was summarized vs. dropped?

The observability stack (OpenAI's approach):

Each agent worktree gets its own ephemeral observability stack:

Logs → Victoria Logs (queryable via LogQL)
Metrics → Victoria Metrics (queryable via PromQL)
Traces → Victoria Traces (queryable via TraceQL)
All data → fanned out via Vector

When the task completes, the stack is torn down. This gives each agent run full observability without cross-contamination.

Part 3: What Makes Harnesses Fundamentally Different

The Inversion of Engineering

In traditional software engineering:

Humans write code → machines execute it
The engineer's job is implementation
Tools serve the human

In harness engineering:

Humans specify intent → agents write and execute code
The engineer's job is environment design
The human serves the agent (by making things legible and enforceable)

OpenAI's framing: "The primary job of our engineering team became enabling the agents to do useful work." When something failed, the fix was never "try harder" — it was always "what capability is missing, and how do we make it legible and enforceable for the agent?"

The Legibility Principle

From the agent's point of view, anything it can't access in-context effectively doesn't exist.

This is the single most important principle in harness engineering. It means:

Slack discussions that aligned the team on an architecture → invisible to agents
Google Docs containing product specs → invisible to agents
Tacit knowledge in engineers' heads → invisible to agents
Knowledge must be pushed into the repo as versioned, structured, agent-readable artifacts

This is why OpenAI treats the repository as the "system of record" — not Notion, not Confluence, not Slack. The repo is the only thing the agent can see.

Mechanical Enforcement Over Documentation

Humans can read a style guide and exercise judgment. Agents pattern-match against what they see in the codebase, including bad patterns. Documentation alone doesn't prevent drift.

The hierarchy of enforcement:

Weakest    ←────────────────────────→    Strongest

README      AGENTS.md     Linter     CI gate     Type system
advice      instructions   warning    that blocks  that won't
                                      merge        compile

OpenAI's rule: "When documentation falls short, we promote the rule into code." Custom linters enforce naming conventions, dependency directions, file size limits, and structured logging. The linter error messages are written specifically to inject remediation instructions into agent context — so when the agent hits a lint failure, the error tells it exactly how to fix it.

Entropy is Constant

Agents replicate patterns that already exist in the codebase — including bad ones. Over time, this leads to architectural drift, style inconsistency, and accumulated "AI slop."

The garbage collection pattern:

Define "golden principles": Opinionated, mechanical rules (prefer shared utilities over hand-rolled helpers; validate boundaries, don't probe YOLO-style).
Automated scanning: Background agents run on a regular cadence, scanning for deviations from golden principles.
Targeted refactoring PRs: Each deviation generates a small, focused PR that can be reviewed in <1 minute and auto-merged.
Quality grades: Each product domain and architectural layer gets a quality score, tracked over time.

OpenAI's team went from spending 20% of their week (every Friday) on manual cleanup to fully automated entropy management. The key insight: "Technical debt is like a high-interest loan — it's almost always better to pay it down continuously in small increments than to let it compound."

Part 4: How Harnesses Are Built (Implementation Patterns)

The Agent Loop

Every harness implements some variant of this core loop:

┌─────────────────────────────────────────────────┐
│                                                 │
│  ┌──────────┐     ┌──────────┐                 │
│  │ Receive  │────▶│ Construct│                 │
│  │ Task     │     │ Context  │                 │
│  └──────────┘     └────┬─────┘                 │
│                        │                        │
│                        ▼                        │
│                  ┌──────────┐                   │
│                  │ LLM Call │◀──────────┐       │
│                  └────┬─────┘           │       │
│                       │                 │       │
│              ┌────────┴────────┐        │       │
│              ▼                 ▼        │       │
│        ┌──────────┐    ┌──────────┐    │       │
│        │ Text     │    │ Tool     │    │       │
│        │ Response │    │ Call     │    │       │
│        └────┬─────┘    └────┬─────┘    │       │
│             │               │          │       │
│             │               ▼          │       │
│             │         ┌──────────┐     │       │
│             │         │ Execute  │     │       │
│             │         │ in       │     │       │
│             │         │ Sandbox  │     │       │
│             │         └────┬─────┘     │       │
│             │              │           │       │
│             │              ▼           │       │
│             │        ┌──────────┐      │       │
│             │        │ Append   │──────┘       │
│             │        │ Result   │              │
│             │        │ to       │              │
│             │        │ Context  │              │
│             ▼        └──────────┘              │
│       ┌──────────┐                             │
│       │ Output / │                             │
│       │ Commit   │                             │
│       └──────────┘                             │
│                                                 │
└─────────────────────────────────────────────────┘

The harness owns every box except "LLM Call" — which is the model provider's responsibility. Everything else — context construction, tool execution, sandboxing, result processing, output delivery — is harness territory.

The AGENTS.md Convention

AGENTS.md is the equivalent of .gitignore or Dockerfile — a convention file that harnesses look for at the root of a repository to understand how to operate.

What research found in the wild (13-16 instruction categories):

Functional directives dominate (coding standards, tool usage, response format)
Non-functional requirements (security, performance, accessibility) are systematically underrepresented
Most files score "Difficult" to "Very difficult" readability
The best files are short (~100 lines), act as a table of contents, and point to deeper docs

The progressive disclosure structure that works:

AGENTS.md (100 lines)
  ├── Points to → ARCHITECTURE.md (system map)
  ├── Points to → docs/FRONTEND.md (UI conventions)
  ├── Points to → docs/SECURITY.md (security rules)
  ├── Points to → docs/PRODUCT_SENSE.md (product principles)
  ├── Points to → docs/exec-plans/active/ (current work)
  └── Points to → docs/design-docs/index.md (design history)

The Agent-to-Agent Review Loop

OpenAI's most advanced pattern: no human review required for most PRs.

Engineer writes prompt
        │
        ▼
Codex generates code + tests
        │
        ▼
Codex reviews its own changes (local)
        │
        ▼
Codex requests additional agent reviews (cloud)
        │
        ▼
Responds to agent feedback, iterates
        │
        ▼
All agent reviewers satisfied?
    │           │
   No          Yes
    │           │
    ▼           ▼
  Iterate    Auto-merge

This is what they call a "Ralph Wiggum Loop" — the agent keeps iterating with itself and other agents until the review passes. Humans review only when judgment is required (product direction, architectural decisions, user-facing trade-offs).

The Layered Architecture Enforcement Pattern

To prevent agents from creating spaghetti dependencies:

Within each business domain:

Types → Config → Repo → Service → Runtime → UI

Rules:
- Code can only depend "forward" (Types → Config is OK, Config → Types is not)
- Cross-cutting concerns (auth, telemetry, feature flags) enter through 
  a single explicit interface: Providers
- Dependency direction is validated by custom linters
- Linter errors include remediation instructions in the error message

This is "the kind of architecture you usually postpone until you have hundreds of engineers." With coding agents, it's an early prerequisite — the constraints are what allows speed without decay.

Part 5: The Current Landscape (Who Built What)

The Five Major Harnesses

Harness	Owner	Strategy	Differentiator
Codex	OpenAI	Model-tied (GPT-5.x), CLI + cloud	Deepest harness engineering philosophy; AGENTS.md convention; GPT-5.3-Codex model
Claude Code / Agent SDK	Anthropic	Model-tied (Claude), SDK + sub-agents	Context compaction research; initializer agent pattern; skills system
Manus	Meta (acquired $2-3B)	Model-agnostic wrapper	Won benchmarks with zero proprietary models; pure harness quality; rewrote 5x in 6 months
DeepAgents	LangChain	Model-agnostic, open-source	Only major general-purpose model-agnostic harness; built on LangGraph
OpenClaw	Open-source (community)	Local-first, 12+ platforms	145K+ GitHub stars; Markdown/YAML state; massive adoption; significant security concerns

The Key Insight from the Landscape

Manus proved the thesis. They competed on labor benchmarks against model companies without training or fine-tuning any models. They used off-the-shelf Claude and competed purely on harness quality — context engineering, tool orchestration, execution reliability. Meta paid $2-3B for this capability.

This means: the harness is where identical models produce dramatically different outcomes. The model is necessary but not sufficient. The infrastructure wrapping the model determines whether you get reliable work or expensive hallucinations.

Part 6: The Unsolved Problems

The 70% Problem

With almost no effort, you can get 70% of the way to a working agent. The remaining 30% requires an order-of-magnitude more investment in harness engineering. This is where most projects stall.

The Five Hardest Problems (Ranked)

1. Constant Harness Rewrites Each new model release changes the optimal agent architecture. Manus rewrote 5x in 6 months. Vercel removed 80% of their agent's tools and got better results. Over-engineered control flow becomes a liability when the underlying model gets smarter. The "Bitter Lesson" applies: the best harness builders practice continuous simplification.

2. Context Engineering at Scale Context debt accumulates silently. Outdated instructions, contradictory guidance, and verbose tool outputs degrade performance over time. There is no "context linter" equivalent — maintaining context quality requires active gardening.

3. Security OWASP rates prompt injection as the #1 LLM risk with 84%+ failure rate against prompt-only defenses. Agents with broad system access (file system, browser, network) create attack surfaces that traditional security models weren't designed for. The OpenClaw ecosystem demonstrates this tension — massive adoption with inadequate security vetting.

4. Long-Run Reliability Current benchmarks test behavior over short bursts. Nobody benchmarks what happens at the 100th tool call, or after 4 hours of autonomous operation. Models that are smart enough to solve a hard puzzle in two tries may fail to follow initial instructions after running for an hour.

5. Browser Automation Complex multi-step UI flows remain fragile. Vision limitations mean agents miss visual bugs. Alert modals, drag-and-drop, and dynamic content create edge cases that break regularly.

Part 7: What the Future of Harnesses Looks Like

Property 1: Harnesses Will Continuously Simplify

As models get smarter, harnesses get thinner. Every rigid assumption about agent control flow gets invalidated by the next model release. Anthropic reportedly strips out harness complexity as Claude improves. The winning strategy is minimal viable scaffolding — just enough structure to keep the agent safe and on-track, not so much that it fights the model's own capabilities.

Property 2: Context Engineering Becomes a First-Class Discipline

In 2023, "prompt engineering" was the hot skill. In 2026, it's "context engineering." This is about designing what information enters the context window, when, in what order, at what granularity, and how it's refreshed. It has its own patterns (progressive disclosure, initializer agents, pre-rot thresholds, attention-aware placement) and anti-patterns (monolithic files, context pollution, few-shot self-repetition). This is an engineering discipline, not an art.

Property 3: Harnesses Become Part of the Model Training Loop

GPT-5.3-Codex was "instrumental in creating itself." The harness generates trajectory data (what the agent tried, what worked, what failed) that feeds back into model fine-tuning. This creates a flywheel: better harness → better trajectory data → better model → simpler harness needed → repeat. Builders who capture this data will define what agents can do next.

Property 4: Security-by-Default Replaces Permission-Grant

The current model: "the agent can do anything unless we restrict it." The future model: "the agent can do nothing unless we allow it." Network disabled by default. Filesystem restricted to a sandbox. Every irreversible action requires explicit human approval. The harness operates as a firewall, not a gateway.

Property 5: Vertical Specialization Wins Over Horizontal Generality

Generic harnesses will be commoditized by cloud providers (AWS, Azure, GCP) within 12-18 months. The durable value is in domain-specific harnesses that embed workflow knowledge, regulatory constraints, and proprietary data for specific industries — legal, healthcare, finance, enterprise IT. These create switching costs that horizontal harnesses cannot.

Property 6: Multi-Agent Coordination Becomes Native

Current harnesses are mostly single-agent. The next generation will natively support:

Agent-to-agent review: One agent writes, another reviews (OpenAI already does this)
Hierarchical delegation: A planning agent decomposes work and assigns sub-agents
Shared memory: Multiple agents read/write to a common state store
Conflict resolution: When two agents make contradictory changes to the same codebase

Property 7: Observability Becomes the Moat

The harness that best understands why agents succeed or fail — through detailed traces, cost analysis, and quality metrics — becomes the one that can improve fastest. LangSmith's strategy (own the quality measurement layer regardless of which framework wins) is directionally correct. You can't improve what you can't measure.

Appendix: The Properties Checklist

Use this to evaluate any harness:

Property	Question	Red Flag
Isolation	Can a rogue agent damage the host system?	No sandboxing, or sandbox with full network access
Context management	How does it handle context overflow?	No compaction strategy; monolithic instruction files
Memory	Does state persist across sessions?	Stateless between runs; no checkpoint mechanism
Model routing	Can it use different models for different tasks?	Hardcoded to a single model/provider
Tool safety	Are irreversible actions gated on human approval?	All tool calls execute without review
Scheduling	Can it run on cron without corrupting active work?	Cron fires into active context; no task queuing
Observability	Can you reconstruct why the agent made a specific decision?	Black box execution; no trace logging
Error recovery	What happens when a tool call fails mid-task?	Agent halts; no retry or checkpoint recovery
Entropy management	How is code/artifact quality maintained over time?	No automated scanning; relies on manual review
Security	What's the blast radius of prompt injection?	Full system access; no permission boundaries