The Consul Harness

A Runtime for Autonomous Executive Intelligence

The Problem

Today's AI assistants are stateless, reckless, and blind.

Stateless. Every conversation starts from zero. Tell the assistant you prefer morning meetings on Tuesday. By Thursday, it's forgotten. The context window is the only reality it has, and it evaporates the moment the session ends. An executive assistant that can't remember your preferences isn't an assistant — it's a stranger you have to re-brief every morning.

Reckless. The assistant has direct access to your email, calendar, documents, and messaging. It can send an email to your board, delete a thread from your investor, double-book your afternoon, or text your client at 2am. There is no gate between "the model decided to do this" and "it happened." In a coding agent, mistakes get caught by tests. In an executive assistant, mistakes get read by people. You can git revert bad code. You cannot unsend an email your CEO already opened.

Blind. When the assistant makes a decision — why it chose this meeting time, why it drafted that response, why it flagged this email as urgent — there is no record. No trace. No reasoning chain you can inspect. When something goes wrong (and it will), you have no forensics. When something goes right, you have no way to understand why so you can make it happen again.

These three problems share a root cause: the model is running naked. There is no operating layer between the intelligence and the world it acts on. No memory system. No safety boundary. No observation deck.

That operating layer is the harness.

What the Consul Harness Is

The harness is the runtime environment that wraps the intelligence layer to make it durable, safe, and legible.

It is not the agents. It is not the tools. It is not the workflows. Those already exist in Consul's architecture — orchestrators that route requests, domain agents that operate on Gmail, Calendar, Docs, Drive, Contacts, and Messaging, tools that execute API calls, and workflows that chain multi-step processes.

The harness is the infrastructure between the user and those agents. It intercepts every request on the way in, enriches it with context and memory, and intercepts every action on the way out, gating it through safety rules and recording it for observability. The agents never know the harness exists. They just receive better context and operate within enforced boundaries.

Think of it this way: the agents are the hands. The tools are what the hands can touch. The harness is the nervous system — it decides what information reaches the brain, what the hands are allowed to do, and what gets recorded for learning.

The Six Properties

The Consul Harness is defined by six properties. Each addresses a fundamental requirement for autonomous executive work.

1. Contextual Awareness

The principle: From the agent's point of view, anything it cannot see in its context window does not exist.

When a user says "schedule a meeting with Sarah about the Q3 budget," the raw request contains almost none of the information needed to act on it. Which Sarah? What time zone? How long? What calendar? What's the user's scheduling preference? Has there been a prior budget conversation? Is there a conflict?

Without the harness, the agent asks five clarifying questions — turning a one-sentence request into a tedious back-and-forth that makes the user feel like they're managing the assistant instead of the other way around.

With the harness, the request arrives at the agent already enriched. The harness has identified Sarah Chen as the user's CFO from contact context. It knows the user prefers morning meetings for financial topics. It's pre-fetched today's calendar and found a conflict at 2pm. It's retrieved the last budget discussion from January and noted there's a shared Drive document. It knows the user is in Pacific time and Sarah is Eastern.

The agent receives all of this as context. It acts decisively. One message back: "I've booked 30 minutes with Sarah Chen tomorrow at 10am your time, 1pm hers. I've attached the Q3 budget doc from your last discussion. Want me to send the invite?"

How this works architecturally: The harness constructs context in layers, loaded progressively based on what the task needs. The first layer is always present — who the user is, their preferences, the current date, what background tasks are running. The second layer loads domain-specific summaries based on intent — if the request is about email, load the email summary; if calendar, load the schedule. The third layer loads full content on demand — the actual email thread, the full calendar day, the document body. The fourth layer retrieves from long-term memory via search — past conversations, relationship history, topic context.

This is progressive disclosure applied to agent context. Give the agent a map, not a manual. Load what it needs when it needs it, and nothing more.

2. Persistent Memory

The principle: An executive assistant that forgets is not an assistant. It's a temp.

Memory operates at three timescales, each serving a different purpose.

Preference memory is the slowest-changing and most valuable. It accumulates over weeks and months as the user interacts with the system. "I don't take meetings before 10am." "Always cc Maria on scheduling emails." "Use formal tone with clients, casual with the team." "John at Acme is a VIP — never auto-send to him." These preferences are rarely stated explicitly. More often, they're inferred from behavior — the user rejects three morning meeting suggestions in a row, and the harness learns to stop proposing them. Preference memory is loaded into every single interaction. It's the harness's understanding of who the user is.

Session memory is task-scoped and ephemeral. When the user starts a scheduling negotiation that spans multiple messages — "find a time with Sarah" → "not Tuesday, I have a dentist appointment" → "how about Thursday?" — the harness maintains a checkpoint of where the conversation stands. What's been tried. What's been rejected. What's pending. If the user comes back four hours later and says "did Sarah confirm?", the harness reconstructs the full context from the checkpoint without the user having to re-explain anything. Session memory is what makes multi-turn tasks feel continuous rather than fragmented.

Relationship memory is about people, not tasks. Every time the agent interacts with or learns about a person in the user's world, that knowledge accumulates. Sarah Chen prefers Zoom over Google Meet. The Acme team is three hours ahead. David from legal takes 48 hours to respond to emails. This context is retrieved whenever a person is mentioned, enriching the agent's understanding of the social and professional landscape the user operates in.

The critical design principle: memory must be inspectable and editable by the user. The user should be able to see what the harness remembers, correct it, and delete it. "You think I prefer morning meetings, but actually I changed that — I'm a night owl now." Opaque memory breeds distrust. Transparent memory builds partnership.

3. Safety Boundaries

The principle: The blast radius of any mistake must be bounded by design, not by luck.

Every action the agent can take falls into one of four tiers, defined by the severity and reversibility of its consequences.

Tier zero actions are read-only. Search emails. Check the calendar. Look up a contact. List files. These execute immediately, silently, without any gate. They have zero side effects and zero risk.

Tier one actions are low-risk writes with limited blast radius. Apply an email label. Create a calendar event on the user's own calendar. Organize files within existing folders. Send a message in an active conversation where the user has already been participating. These execute with a brief notification — the user sees what happened but doesn't need to approve it in advance. They can be undone.

Tier two actions are consequential and potentially irreversible. Send an email to someone new. Send a calendar invite to an external participant. Share a document. Send a message to someone the user hasn't messaged recently. Draft an email that contains financial figures, commitments, or legal language. These block until the user explicitly approves. The harness presents the proposed action, waits for confirmation, and only then executes.

Tier three actions are prohibited entirely. Bulk deletions. Forwarding sensitive documents externally. Sending messages from the user's phone to unknown numbers. Account-level changes. The agent cannot perform these regardless of what it's asked, and escalates to the user with an explanation of why.

The tier system is conservative by default and loosens over time. When a user first connects their Gmail, every outbound email is tier two — requires approval. As the harness observes the user approving certain patterns ("yes, always send scheduling confirmations to people in my contacts"), those specific patterns graduate to tier one. Trust is earned incrementally. It is never assumed.

Underneath the tier system sits the audit trail. Every action — whether auto-executed, approved, or blocked — is recorded with full context: what the agent intended to do, why it decided to do it, what information it had at the time, what the outcome was. This is not logging for debugging. This is a complete forensic record that answers the question "why did my assistant do that?" for any stakeholder at any time.

4. Durable Execution

The principle: Executive work happens on human timescales — hours, days, weeks — not millisecond API calls.

A scheduling negotiation might span five days. Send availability Monday. Get a response Wednesday. Counter-propose Thursday. Confirm Friday. Each step involves a different context, different information, and potentially a different trigger (email response, calendar change, timer expiration).

No context window can stay open for five days. The harness solves this by treating long-running tasks as a series of short bursts with durable state in between. The agent works for a few minutes, writes a structured checkpoint capturing everything it knows and everything it's waiting for, and sleeps. When a trigger fires — an email arrives, a calendar event changes, a timer expires — the harness wakes the agent, reconstructs context from the checkpoint plus new information, and the agent continues where it left off.

This requires solving three scheduling problems specific to executive work.

Task isolation. The daily briefing that fires at 8am must not interfere with the scheduling conversation the user is actively having. Background tasks and foreground tasks run in separate execution contexts. They can read shared state (the user's calendar, email) but they cannot write to each other's context.

Priority interrupts. An email from the CEO arriving during background file organization should preempt the file work, not wait in a queue. The harness maintains a priority system: user-initiated requests outrank background tasks, and certain senders or keywords trigger immediate escalation regardless of what else is running.

Graceful degradation. When a task fails mid-execution — an API times out, a model returns garbage, a rate limit is hit — the harness saves state, retries with exponential backoff, and escalates to the user only when automated recovery is exhausted. The user should never see "something went wrong, please try again." They should see "I wasn't able to reach Google Calendar, but I've saved the meeting details and will book it as soon as the connection is restored."

5. Economic Intelligence

The principle: Not every task deserves a frontier model. Not every user gets unlimited spend.

Applying an email label requires pattern matching. Drafting a nuanced response to a sensitive client email requires sophisticated reasoning. Routing intent from the orchestrator to the right domain agent requires classification. Negotiating a multi-party scheduling conflict requires planning and social awareness.

The harness routes each task to the appropriate model tier — fast and cheap for simple operations, mid-tier for standard work, frontier for complex reasoning. This isn't about being stingy. It's about being sustainable. An executive assistant that costs $50/day in API calls is a product with negative unit economics. One that costs$ 3/day because it intelligently routes 80% of tasks to cheap models is a business.

The harness also tracks cost at every level — per task, per conversation, per user, per day. It enforces budgets (daily and monthly ceilings), alerts when spend patterns change, and provides the data needed to optimize routing rules over time. When the model landscape shifts — a new cheap model becomes available, a frontier model drops its price — the routing table updates without touching the agents.

6. Full Observability

The principle: If you can't explain why the agent did something, you can't trust it, improve it, or defend it.

Every interaction produces a trace. The trace records: what context was loaded (and what wasn't), which model was selected (and why), what the agent decided (and its reasoning), what tools were called (and their results), what safety tier was applied (and whether the user approved), how much it cost (in tokens and dollars), and how long it took (end to end and per step).

This observability serves three audiences.

The user needs to understand what happened. "Why did you book at 10am instead of 2pm?" The trace shows: 2pm had a conflict with the user's existing meeting, and the user's preference memory indicates a preference for mornings. Clear. Defensible. Inspectable.

The operator needs to improve the system. Which agents are slowest? Which tool calls fail most often? Which safety tier gets the most overrides? Where is cost concentrated? Observability data feeds directly into harness optimization — adjusting routing rules, refining safety tiers, improving context loading.

The enterprise buyer needs compliance and auditability. Every action the agent took, with full provenance, is available for review. When the question is "did the AI access this person's email?" the answer is in the audit trail, not in speculation.

Why an Executive Assistant Harness Is Different

The harness concept applies broadly to all agent types — coding agents, research agents, data agents. But an executive assistant harness has a unique constraint that shapes everything: every mistake has social consequences.

A coding agent that writes bad code gets caught by tests. The code never ships. The blast radius is zero. A research agent that returns a wrong fact gets corrected in the next prompt. The document gets revised. The consequence is time lost.

An executive assistant that sends the wrong email — that email has been read. An assistant that books a meeting over someone's lunch — that person is already annoyed. An assistant that sends a message to the wrong contact — that message has already been delivered. There is no compiler. There is no test suite. There is no staging environment for human relationships.

This means:

Default to draft mode. The harness should prepare every outbound action as a draft first and present it for review. Only after the user has explicitly opted into auto-execution for specific patterns — "yes, always auto-send scheduling confirmations to people in my contacts" — should the harness execute without confirmation. The trust gradient moves from "show me everything" to "handle it" over weeks, not minutes.

Optimize for the cost of errors, not the cost of delays. In a coding agent, speed matters and errors are cheap. In an executive assistant, precision matters and errors are expensive. A 5-second delay to check the safety tier is invisible to the user. A wrong email to the board is unforgettable.

Make rollback a first-class operation. Where possible, every action should have a corresponding undo. Unsend a draft. Cancel an event. Retract a message. Archive instead of delete. The harness should prefer reversible actions over irreversible ones, even when the agent's first instinct is to act directly.

Build trust through transparency, not through perfection. The agent will make mistakes. The question is whether the user can see what happened, understand why, and correct the system so it doesn't happen again. A transparent harness that makes occasional mistakes and learns from them earns more trust than an opaque one that's usually right but inexplicable when it's wrong.

The Consul Harness Model

Putting it all together, the Consul Harness operates as a six-layer intercept between the user and the intelligence:

Inbound path (user → agents):

Context Engine enriches the request with identity, preferences, domain summaries, relationship context, and retrieved memories — layered progressively based on what the task needs.
Memory Manager loads any active session state and relevant long-term memories, giving the agent continuity across conversations.
Model Router selects the appropriate model tier and tracks cost, ensuring economic sustainability.

Outbound path (agents → world): 4. Safety Gate classifies every proposed action into a tier, gates consequential actions on user approval, and records everything to the audit trail. 5. Scheduler manages task durability — checkpointing long-running work, isolating background tasks from foreground conversations, and handling priority interrupts. 6. Observability Layer records the complete trace of every interaction — context, reasoning, actions, costs — for user inspection, system improvement, and compliance.

The agents themselves — the orchestrators, the domain agents, the tools, the workflows — remain unchanged. They don't know the harness exists. They receive richer context on the way in and operate within enforced boundaries on the way out. The harness makes them better without modifying them.

The Build Sequence

The harness doesn't arrive fully formed. It's built in phases, each delivering standalone value while enabling the next.

Phase 1: Safety. Before anything else, bound the blast radius. The tier system, approval gates, and audit trail are the foundation. Without them, the agent is a liability. With them, it's a product you can put in front of real users.

Phase 2: Memory. Give the agent continuity. Preference memory makes it personal. Session memory makes it coherent across conversations. Together, they transform the experience from "talking to a stranger" to "working with someone who knows me."

Phase 3: Context. Make the agent proactive. Layered context loading means the agent already knows which Sarah, already knows the schedule conflict, already has the relevant document. This is the phase where the assistant starts to feel intelligent rather than merely capable.

Phase 4: Observability. Make the system legible. Traces, cost tracking, and decision logs give you the data to debug, improve, and build trust with users and enterprise buyers.

Phase 5: Economics. Make it sustainable. Model routing and cost management mean the system doesn't burn through API budget on trivial tasks. This is what turns a prototype into a viable product.

Phase 6: Durability. Make it autonomous. Durable execution, checkpointing, and long-running task management are what enable the assistant to handle multi-day workflows — scheduling negotiations, follow-up chains, project coordination — without losing state.

Each phase compounds on the previous ones. Safety enables real users. Memory makes those users come back. Context makes them rely on the agent. Observability makes the system improvable. Economics makes it scalable. Durability makes it truly autonomous.

The Endgame

The fully realized Consul Harness produces an executive assistant that is:

Reliable — it will not send an email you didn't approve, book a meeting you can't attend, or delete a thread you need. Its boundaries are enforced mechanically, not hoped for.

Continuous — it remembers your preferences, your relationships, your active tasks, and your history. It picks up where it left off. It never asks you to repeat yourself.

Intelligent — it arrives at every task with the right context already loaded. It knows who Sarah is, what your schedule looks like, and what you discussed last time. It acts decisively because it has what it needs.

Transparent — every action it takes is recorded, inspectable, and explainable. You can always ask "why did you do that?" and get a real answer.

Economical — it uses the right model for the right task, tracks its own costs, and operates within defined budgets. It's a product with positive unit economics, not a research demo.

Durable — it handles work that spans hours and days, not just seconds. Scheduling negotiations, follow-up chains, and project coordination run reliably across sessions, surviving interruptions and failures gracefully.

The harness is what makes all six properties possible simultaneously. Without it, you can have an agent that's sometimes intelligent but unpredictably reckless. With it, you have an executive assistant that earns trust through consistent, bounded, observable behavior — and keeps earning it, interaction after interaction, week after week.

That's the Consul Harness.