Agent Harnesses: Technical Spark Notes
A deep architectural breakdown for builders, not observers.
Part 1: What a Harness Actually Is
The One-Sentence Definition
A harness is the persistent runtime environment that wraps an LLM to transform it from a stateless text-completion engine into a durable, tool-using, memory-having autonomous agent.
The Analogy That Actually Holds Up
Model = CPU. Context window = RAM. Harness = Operating System.
A CPU without an OS can execute instructions but can't manage processes, allocate memory, handle I/O, enforce permissions, or recover from crashes. Similarly, a frontier model without a harness can generate brilliant text but can't persist state between sessions, safely execute code, remember what it did yesterday, recover from errors mid-task, or coordinate with other agents.
The harness is what makes raw intelligence operational.
What a Harness Is NOT
| Term | What It Is | How It Differs From a Harness |
|---|---|---|
| Wrapper | Thin API translation layer (e.g., OpenAI SDK → Anthropic SDK) | A component inside a harness, not a harness itself |
| Framework | Building blocks + abstractions (LangChain, CrewAI, Mastra) | Gives you Legos. A harness gives you a built house you can remodel |
| Runtime | Durable execution infrastructure (LangGraph, Temporal) | Handles crash recovery and persistence, but not planning, context, or tools |
| Orchestrator | Control flow manager ("do step A, then B, then C") | The "brain" deciding what happens next. The harness is the "body" ensuring it happens safely |
| Scaffold | One-time setup structure | Temporary. A harness is the persistent operating environment |
| Agent OS | Near-synonym | "Agent OS" is a metaphor. "Harness" is a concrete software system |
The Taxonomy (Harrison Chase / LangChain, Oct 2025)
1┌──────────────────────────────────────────────┐
2│ HARNESS │
3│ Opinionated, ready-to-run agent with │
4│ defaults for prompts, planning, tools, │
5│ memory, context management, sub-agents │
6│ │
7│ ┌────────────────────────────────────────┐ │
8│ │ RUNTIME │ │
9│ │ Durable execution, persistence, │ │
10│ │ crash recovery, streaming │ │
11│ │ │ │
12│ │ ┌──────────────────────────────────┐ │ │
13│ │ │ FRAMEWORK │ │ │
14│ │ │ Abstractions, building blocks, │ │ │
15│ │ │ model interfaces, tool schemas │ │ │
16│ │ └──────────────────────────────────┘ │ │
17│ └────────────────────────────────────────┘ │
18└──────────────────────────────────────────────┘The critical distinction: frameworks require assembly, harnesses work out of the box. A framework says "here are the parts, build what you want." A harness says "here's a working agent — customize it."
Part 2: The Seven Architectural Pillars
Every production harness implements these seven subsystems. The quality of each determines whether the agent reliably completes work or produces expensive hallucinations.
Pillar 1: Sandbox & Isolation
What it solves: Agents execute code, modify files, and make network requests. Without isolation, a single hallucinated rm -rf / destroys the host system.
The isolation hierarchy (weakest → strongest):
| Level | Technology | Cold Start | Use Case |
|---|---|---|---|
| Process-level | V8 isolates, Deno | ~1ms | Lightweight scripting |
| Container | Docker, Podman | ~500ms | Standard agent sandboxing |
| Container + kernel hardening | gVisor, seccomp | ~800ms | Defense-in-depth |
| MicroVM | Firecracker, Kata | ~150ms | Maximum isolation (E2B, OpenAI Codex cloud) |
| Full VM | QEMU/KVM | ~2-5s | Legacy, rarely used for agents |
Key architectural decisions:
- Network access: OpenAI Codex disables network by default in cloud tasks. This is the security-first posture. Agents that need network get explicit allowlists.
- Filesystem scope: Each agent task gets its own filesystem overlay (copy-on-write). Changes are ephemeral unless explicitly committed. OpenAI achieves this with per-worktree git checkouts — each Codex run gets an isolated copy of the repo.
- Resource limits: CPU time, memory, disk I/O, and execution duration are capped. Without these, a confused agent can burn infinite compute.
- Teardown: The sandbox and everything in it (logs, metrics, temp files) are destroyed when the task completes. This prevents cross-contamination between runs.
The emerging pattern: Universal sandbox adapters (e.g., Rivet's sandbox-agent) that provide a single HTTP API in front of any sandbox provider. The agent doesn't know or care whether it's running in Firecracker, Docker, or bare metal — the harness abstracts this entirely.
Pillar 2: Context Engineering
What it solves: The context window is the agent's entire reality. Everything it can't see in-context effectively doesn't exist. Context engineering is the discipline of constructing what the agent sees and when.
Why this is the hardest problem:
Even models with 1M+ token windows degrade well before that limit. Manus discovered a practical "Pre-Rot Threshold" of ~256K tokens — beyond that, performance collapses regardless of advertised capacity. The effective context window is far smaller than the theoretical one.
The four core techniques:
1. Progressive Disclosure (a.k.a. "Give them a map, not a manual")
Don't dump everything into context at once. Structure knowledge in layers:
1Layer 0: AGENTS.md (~100 lines)
2 → "Here's who you are, here's the table of contents, here's where to look"
3
4Layer 1: ARCHITECTURE.md, FRONTEND.md, SECURITY.md
5 → Domain-specific guidance, loaded on-demand when the task touches that area
6
7Layer 2: docs/design-docs/, docs/product-specs/
8 → Detailed specs, loaded only when actively working on that feature
9
10Layer 3: The code itself
11 → Read individual files as needed, never wholesaleOpenAI learned this the hard way. They tried a monolithic AGENTS.md and it failed because: (a) it crowded out the actual task, (b) when everything is "important" nothing is, and (c) it rotted instantly. The solution: AGENTS.md as a table of contents, not an encyclopedia.
2. Context Compaction
Long-running agents accumulate conversation history that eventually fills the window. Compaction strategies:
- Periodic summarization: Every N turns, summarize the conversation so far and replace the full history with the summary. Anthropic does this automatically.
- Sliding window with anchors: Keep the system prompt + last N messages + any "pinned" messages (key decisions, error states).
- Tool call pruning: Tool inputs/outputs are verbose. After processing, replace them with a summary of what was learned.
3. Initializer Agent Pattern (Anthropic)
Use a different prompt for the first context window versus subsequent ones. The initializer agent acts as a boot sequence — it reads the repository, understands the task, creates a plan file, then hands off to the "worker" agent that actually executes. This is equivalent to BIOS → OS boot → application launch.
4. Attention-Aware Placement
LLMs attend differently to different positions in context. Manus exploits this with their todo.md pattern — the task list is constantly rewritten to stay at the end of the context window, where recency bias keeps it most salient. Instructions at the beginning benefit from primacy effects. The dead zone is the middle.
The anti-patterns:
- Monolithic instruction files: Rot instantly, crowd out the task, provide non-guidance.
- Context pollution: Loading irrelevant information "just in case" — every token has an opportunity cost.
- Few-shot self-repetition: Agents that see their own prior outputs start copying patterns rather than reasoning fresh.
- Context debt: Ambiguous, contradictory, or outdated agent instructions that accumulate over time. This is the equivalent of technical debt but for agent configuration.
Pillar 3: Memory Management
What it solves: LLMs are stateless. Every API call starts from zero. Memory gives agents continuity across sessions.
The three memory tiers:
| Tier | Scope | Implementation | Example |
|---|---|---|---|
| Working memory | Current task | The active context window | "I'm currently debugging the login form" |
| Session memory | Current session, survives context compaction | File-based checkpoints (claude-progress.txt, todo.md) | "I've completed steps 1-3, step 4 is in progress" |
| Long-term memory | Cross-session, persistent | Vector databases, LangGraph Store, filesystem | "This user prefers TypeScript, their API key is stored at ~/.config" |
File-based memory dominates in practice:
- Anthropic:
claude-progress.txt— structured log enabling session handoff. When a context window fills, the next session reads this file to reconstruct state. - OpenAI Codex:
AGENTS.md+docs/exec-plans/active/— execution plans with progress and decision logs, versioned in git. - Manus:
todo.md— constantly rewritten task list that serves as both memory and attention anchor. - OpenClaw: All state stored as Markdown/YAML files on local disk. Memory is literally files.
Why file-based > vector DB for most agent memory:
Vector databases excel at semantic search across large corpora. But most agent memory is structured, recent, and task-specific — closer to a scratchpad than a knowledge base. Files are inspectable, diffable, version-controllable, and trivially readable by the agent. The overhead of embedding, indexing, and querying a vector store is often worse than just reading a well-organized file.
Where vector search does matter: Long-term memory across many sessions (e.g., "what did the user tell me about their preferences 3 weeks ago?"), and retrieval over large codebases (finding the right file in a 10,000-file repo).
Pillar 4: Model Routing
What it solves: No single model is optimal for all tasks. Routing sends easy tasks to cheap models and hard tasks to expensive ones.
The routing hierarchy:
1User intent → Task classification
2 │
3 ┌─────────┼──────────┐
4 ▼ ▼ ▼
5 Simple Medium Complex
6 (Haiku) (Sonnet) (Opus)
7 ~$0.25 ~$3.00 ~$15.00
8 per 1M per 1M per 1MWhat a production router handles:
- Task-to-model mapping: "Reformatting this JSON → Haiku. Designing the API architecture → Opus."
- Fallback chains: Primary model times out or rate-limits → try secondary → try tertiary.
- Provider abstraction: Swap between Anthropic, OpenAI, Google without changing agent code. The harness handles auth, format translation, and error normalization.
- Cost accounting: Track spend per task, per agent, per session. Alert when budget thresholds are hit.
The pain point everyone reports: Model routing is awful in practice. Models use different naming conventions, different API formats, different error codes. Auth juggling across providers is a constant source of bugs. This is one of the top four setup pain points cited by practitioners.
Pillar 5: Tool Orchestration
What it solves: Agents need to do things — execute code, browse the web, read files, call APIs, query databases. The harness manages tool registration, invocation, permission, and error handling.
MCP (Model Context Protocol) is the emerging standard:
Introduced by Anthropic (Nov 2024), donated to Linux Foundation (Dec 2025), adopted by OpenAI, Google DeepMind, Microsoft. MCP provides:
- A standard JSON-RPC protocol for tool registration and invocation
- Client-server architecture where tool servers expose capabilities
- Schema-based tool descriptions that models can reason about
- Transport-agnostic (stdio, HTTP, WebSocket)
The tool taxonomy:
| Category | Examples | Risk Level |
|---|---|---|
| Read-only | File read, web search, database query | Low |
| Reversible writes | File edit with git, branch creation | Medium |
| Irreversible writes | Email send, database delete, API call with side effects | High (requires human approval) |
| System-level | Process execution, network configuration, credential access | Critical (sandboxed or prohibited) |
Browser automation is the least solved problem:
Harnesses use Chrome DevTools Protocol (CDP), Playwright, or Puppeteer to give agents browser access. OpenAI wired CDP directly into the Codex runtime for DOM snapshots, screenshots, and navigation. But significant limitations persist — agents can't see browser-native alert modals through Puppeteer, vision-based approaches miss entire categories of visual bugs, and complex multi-step UI flows remain fragile.
Pillar 6: Scheduling & Triggers
What it solves: Agents need to run on schedules (cron), in response to events (webhooks), and in long-running loops.
The dominant pattern for long-running work:
1┌─────────────────────────────────────────┐
2│ OUTER LOOP (harness-managed) │
3│ │
4│ 1. Load/create checkpoint file │
5│ 2. Inject into fresh context window │
6│ 3. Run agent (5-60 min burst) │
7│ 4. Agent writes progress to checkpoint │
8│ 5. Context window fills or task pauses │
9│ 6. Harness reads checkpoint │
10│ 7. → Go to step 1 │
11│ │
12│ Repeat until: task complete, budget │
13│ exhausted, or human escalation │
14└─────────────────────────────────────────┘Anthropic's research found that short focused bursts (~5 minutes) with structured handoff outperform marathon sessions. The Codex team regularly sees single runs work for 6+ hours, but these are structured as many bursts with checkpoint persistence, not one continuous context.
Cron reliability is a major pain point. When a cron job fires while the agent is mid-task, it can corrupt state or override the current context. Solutions:
- Task queuing: Cron jobs add to a queue; the agent processes them when idle.
- Priority interrupts: Critical events preempt current work, with state saved first.
- Heartbeat isolation: Background health checks run in separate agent instances, not in the primary task context.
Pillar 7: Observability
What it solves: When an agent runs for 6 hours autonomously, you need to understand what happened, why it made specific decisions, where it went wrong, and how much it cost.
What a harness must track:
- Planning traces: What plan did the agent create? How did it decompose the task?
- Tool call logs: Every tool invocation with inputs, outputs, latency, and success/failure.
- Context window snapshots: What was in context at each decision point?
- Token consumption: Per-turn and cumulative, broken down by model.
- Cost accounting: Real-time spend tracking with budget alerts.
- State transitions: What checkpoints were written? What decisions were logged?
- Handoff points: When did context compaction occur? What was summarized vs. dropped?
The observability stack (OpenAI's approach):
Each agent worktree gets its own ephemeral observability stack:
- Logs → Victoria Logs (queryable via LogQL)
- Metrics → Victoria Metrics (queryable via PromQL)
- Traces → Victoria Traces (queryable via TraceQL)
- All data → fanned out via Vector
When the task completes, the stack is torn down. This gives each agent run full observability without cross-contamination.
Part 3: What Makes Harnesses Fundamentally Different
The Inversion of Engineering
In traditional software engineering:
- Humans write code → machines execute it
- The engineer's job is implementation
- Tools serve the human
In harness engineering:
- Humans specify intent → agents write and execute code
- The engineer's job is environment design
- The human serves the agent (by making things legible and enforceable)
OpenAI's framing: "The primary job of our engineering team became enabling the agents to do useful work." When something failed, the fix was never "try harder" — it was always "what capability is missing, and how do we make it legible and enforceable for the agent?"
The Legibility Principle
From the agent's point of view, anything it can't access in-context effectively doesn't exist.
This is the single most important principle in harness engineering. It means:
- Slack discussions that aligned the team on an architecture → invisible to agents
- Google Docs containing product specs → invisible to agents
- Tacit knowledge in engineers' heads → invisible to agents
- Knowledge must be pushed into the repo as versioned, structured, agent-readable artifacts
This is why OpenAI treats the repository as the "system of record" — not Notion, not Confluence, not Slack. The repo is the only thing the agent can see.
Mechanical Enforcement Over Documentation
Humans can read a style guide and exercise judgment. Agents pattern-match against what they see in the codebase, including bad patterns. Documentation alone doesn't prevent drift.
The hierarchy of enforcement:
1Weakest ←────────────────────────→ Strongest
2
3README AGENTS.md Linter CI gate Type system
4advice instructions warning that blocks that won't
5 merge compileOpenAI's rule: "When documentation falls short, we promote the rule into code." Custom linters enforce naming conventions, dependency directions, file size limits, and structured logging. The linter error messages are written specifically to inject remediation instructions into agent context — so when the agent hits a lint failure, the error tells it exactly how to fix it.
Entropy is Constant
Agents replicate patterns that already exist in the codebase — including bad ones. Over time, this leads to architectural drift, style inconsistency, and accumulated "AI slop."
The garbage collection pattern:
- Define "golden principles": Opinionated, mechanical rules (prefer shared utilities over hand-rolled helpers; validate boundaries, don't probe YOLO-style).
- Automated scanning: Background agents run on a regular cadence, scanning for deviations from golden principles.
- Targeted refactoring PRs: Each deviation generates a small, focused PR that can be reviewed in <1 minute and auto-merged.
- Quality grades: Each product domain and architectural layer gets a quality score, tracked over time.
OpenAI's team went from spending 20% of their week (every Friday) on manual cleanup to fully automated entropy management. The key insight: "Technical debt is like a high-interest loan — it's almost always better to pay it down continuously in small increments than to let it compound."
Part 4: How Harnesses Are Built (Implementation Patterns)
The Agent Loop
Every harness implements some variant of this core loop:
1┌─────────────────────────────────────────────────┐
2│ │
3│ ┌──────────┐ ┌──────────┐ │
4│ │ Receive │────▶│ Construct│ │
5│ │ Task │ │ Context │ │
6│ └──────────┘ └────┬─────┘ │
7│ │ │
8│ ▼ │
9│ ┌──────────┐ │
10│ │ LLM Call │◀──────────┐ │
11│ └────┬─────┘ │ │
12│ │ │ │
13│ ┌────────┴────────┐ │ │
14│ ▼ ▼ │ │
15│ ┌──────────┐ ┌──────────┐ │ │
16│ │ Text │ │ Tool │ │ │
17│ │ Response │ │ Call │ │ │
18│ └────┬─────┘ └────┬─────┘ │ │
19│ │ │ │ │
20│ │ ▼ │ │
21│ │ ┌──────────┐ │ │
22│ │ │ Execute │ │ │
23│ │ │ in │ │ │
24│ │ │ Sandbox │ │ │
25│ │ └────┬─────┘ │ │
26│ │ │ │ │
27│ │ ▼ │ │
28│ │ ┌──────────┐ │ │
29│ │ │ Append │──────┘ │
30│ │ │ Result │ │
31│ │ │ to │ │
32│ │ │ Context │ │
33│ ▼ └──────────┘ │
34│ ┌──────────┐ │
35│ │ Output / │ │
36│ │ Commit │ │
37│ └──────────┘ │
38│ │
39└─────────────────────────────────────────────────┘The harness owns every box except "LLM Call" — which is the model provider's responsibility. Everything else — context construction, tool execution, sandboxing, result processing, output delivery — is harness territory.
The AGENTS.md Convention
AGENTS.md is the equivalent of .gitignore or Dockerfile — a convention file that harnesses look for at the root of a repository to understand how to operate.
What research found in the wild (13-16 instruction categories):
- Functional directives dominate (coding standards, tool usage, response format)
- Non-functional requirements (security, performance, accessibility) are systematically underrepresented
- Most files score "Difficult" to "Very difficult" readability
- The best files are short (~100 lines), act as a table of contents, and point to deeper docs
The progressive disclosure structure that works:
1AGENTS.md (100 lines)
2 ├── Points to → ARCHITECTURE.md (system map)
3 ├── Points to → docs/FRONTEND.md (UI conventions)
4 ├── Points to → docs/SECURITY.md (security rules)
5 ├── Points to → docs/PRODUCT_SENSE.md (product principles)
6 ├── Points to → docs/exec-plans/active/ (current work)
7 └── Points to → docs/design-docs/index.md (design history)The Agent-to-Agent Review Loop
OpenAI's most advanced pattern: no human review required for most PRs.
1Engineer writes prompt
2 │
3 ▼
4Codex generates code + tests
5 │
6 ▼
7Codex reviews its own changes (local)
8 │
9 ▼
10Codex requests additional agent reviews (cloud)
11 │
12 ▼
13Responds to agent feedback, iterates
14 │
15 ▼
16All agent reviewers satisfied?
17 │ │
18 No Yes
19 │ │
20 ▼ ▼
21 Iterate Auto-mergeThis is what they call a "Ralph Wiggum Loop" — the agent keeps iterating with itself and other agents until the review passes. Humans review only when judgment is required (product direction, architectural decisions, user-facing trade-offs).
The Layered Architecture Enforcement Pattern
To prevent agents from creating spaghetti dependencies:
1Within each business domain:
2
3Types → Config → Repo → Service → Runtime → UI
4
5Rules:
6- Code can only depend "forward" (Types → Config is OK, Config → Types is not)
7- Cross-cutting concerns (auth, telemetry, feature flags) enter through
8 a single explicit interface: Providers
9- Dependency direction is validated by custom linters
10- Linter errors include remediation instructions in the error messageThis is "the kind of architecture you usually postpone until you have hundreds of engineers." With coding agents, it's an early prerequisite — the constraints are what allows speed without decay.
Part 5: The Current Landscape (Who Built What)
The Five Major Harnesses
| Harness | Owner | Strategy | Differentiator |
|---|---|---|---|
| Codex | OpenAI | Model-tied (GPT-5.x), CLI + cloud | Deepest harness engineering philosophy; AGENTS.md convention; GPT-5.3-Codex model |
| Claude Code / Agent SDK | Anthropic | Model-tied (Claude), SDK + sub-agents | Context compaction research; initializer agent pattern; skills system |
| Manus | Meta (acquired $2-3B) | Model-agnostic wrapper | Won benchmarks with zero proprietary models; pure harness quality; rewrote 5x in 6 months |
| DeepAgents | LangChain | Model-agnostic, open-source | Only major general-purpose model-agnostic harness; built on LangGraph |
| OpenClaw | Open-source (community) | Local-first, 12+ platforms | 145K+ GitHub stars; Markdown/YAML state; massive adoption; significant security concerns |
The Key Insight from the Landscape
Manus proved the thesis. They competed on labor benchmarks against model companies without training or fine-tuning any models. They used off-the-shelf Claude and competed purely on harness quality — context engineering, tool orchestration, execution reliability. Meta paid $2-3B for this capability.
This means: the harness is where identical models produce dramatically different outcomes. The model is necessary but not sufficient. The infrastructure wrapping the model determines whether you get reliable work or expensive hallucinations.
Part 6: The Unsolved Problems
The 70% Problem
With almost no effort, you can get 70% of the way to a working agent. The remaining 30% requires an order-of-magnitude more investment in harness engineering. This is where most projects stall.
The Five Hardest Problems (Ranked)
1. Constant Harness Rewrites Each new model release changes the optimal agent architecture. Manus rewrote 5x in 6 months. Vercel removed 80% of their agent's tools and got better results. Over-engineered control flow becomes a liability when the underlying model gets smarter. The "Bitter Lesson" applies: the best harness builders practice continuous simplification.
2. Context Engineering at Scale Context debt accumulates silently. Outdated instructions, contradictory guidance, and verbose tool outputs degrade performance over time. There is no "context linter" equivalent — maintaining context quality requires active gardening.
3. Security OWASP rates prompt injection as the #1 LLM risk with 84%+ failure rate against prompt-only defenses. Agents with broad system access (file system, browser, network) create attack surfaces that traditional security models weren't designed for. The OpenClaw ecosystem demonstrates this tension — massive adoption with inadequate security vetting.
4. Long-Run Reliability Current benchmarks test behavior over short bursts. Nobody benchmarks what happens at the 100th tool call, or after 4 hours of autonomous operation. Models that are smart enough to solve a hard puzzle in two tries may fail to follow initial instructions after running for an hour.
5. Browser Automation Complex multi-step UI flows remain fragile. Vision limitations mean agents miss visual bugs. Alert modals, drag-and-drop, and dynamic content create edge cases that break regularly.
Part 7: What the Future of Harnesses Looks Like
Property 1: Harnesses Will Continuously Simplify
As models get smarter, harnesses get thinner. Every rigid assumption about agent control flow gets invalidated by the next model release. Anthropic reportedly strips out harness complexity as Claude improves. The winning strategy is minimal viable scaffolding — just enough structure to keep the agent safe and on-track, not so much that it fights the model's own capabilities.
Property 2: Context Engineering Becomes a First-Class Discipline
In 2023, "prompt engineering" was the hot skill. In 2026, it's "context engineering." This is about designing what information enters the context window, when, in what order, at what granularity, and how it's refreshed. It has its own patterns (progressive disclosure, initializer agents, pre-rot thresholds, attention-aware placement) and anti-patterns (monolithic files, context pollution, few-shot self-repetition). This is an engineering discipline, not an art.
Property 3: Harnesses Become Part of the Model Training Loop
GPT-5.3-Codex was "instrumental in creating itself." The harness generates trajectory data (what the agent tried, what worked, what failed) that feeds back into model fine-tuning. This creates a flywheel: better harness → better trajectory data → better model → simpler harness needed → repeat. Builders who capture this data will define what agents can do next.
Property 4: Security-by-Default Replaces Permission-Grant
The current model: "the agent can do anything unless we restrict it." The future model: "the agent can do nothing unless we allow it." Network disabled by default. Filesystem restricted to a sandbox. Every irreversible action requires explicit human approval. The harness operates as a firewall, not a gateway.
Property 5: Vertical Specialization Wins Over Horizontal Generality
Generic harnesses will be commoditized by cloud providers (AWS, Azure, GCP) within 12-18 months. The durable value is in domain-specific harnesses that embed workflow knowledge, regulatory constraints, and proprietary data for specific industries — legal, healthcare, finance, enterprise IT. These create switching costs that horizontal harnesses cannot.
Property 6: Multi-Agent Coordination Becomes Native
Current harnesses are mostly single-agent. The next generation will natively support:
- Agent-to-agent review: One agent writes, another reviews (OpenAI already does this)
- Hierarchical delegation: A planning agent decomposes work and assigns sub-agents
- Shared memory: Multiple agents read/write to a common state store
- Conflict resolution: When two agents make contradictory changes to the same codebase
Property 7: Observability Becomes the Moat
The harness that best understands why agents succeed or fail — through detailed traces, cost analysis, and quality metrics — becomes the one that can improve fastest. LangSmith's strategy (own the quality measurement layer regardless of which framework wins) is directionally correct. You can't improve what you can't measure.
Appendix: The Properties Checklist
Use this to evaluate any harness:
| Property | Question | Red Flag |
|---|---|---|
| Isolation | Can a rogue agent damage the host system? | No sandboxing, or sandbox with full network access |
| Context management | How does it handle context overflow? | No compaction strategy; monolithic instruction files |
| Memory | Does state persist across sessions? | Stateless between runs; no checkpoint mechanism |
| Model routing | Can it use different models for different tasks? | Hardcoded to a single model/provider |
| Tool safety | Are irreversible actions gated on human approval? | All tool calls execute without review |
| Scheduling | Can it run on cron without corrupting active work? | Cron fires into active context; no task queuing |
| Observability | Can you reconstruct why the agent made a specific decision? | Black box execution; no trace logging |
| Error recovery | What happens when a tool call fails mid-task? | Agent halts; no retry or checkpoint recovery |
| Entropy management | How is code/artifact quality maintained over time? | No automated scanning; relies on manual review |
| Security | What's the blast radius of prompt injection? | Full system access; no permission boundaries |