MDX Limo
Agent Cost Optimization Report — Pushable 3.0

Agent Cost Optimization Report — Pushable 3.0

Date: 2026-03-30 Scope: Per-iteration token cost reduction for agent runs (excludes model/provider selection) Current state: ~82 tools (CEO agent), ~25 tools (normal agents), 25k+ token prompts


Executive Summary

The agent system already has several solid cost optimizations in place. However, there are 5-6 high-impact areas where significant savings are being left on the table — particularly around tool definitions bloat, unbounded tool results, missing trimMessages, and the Composio instruction mega-block. The biggest single win is likely dynamic tool selection, which could cut 40-60% of tool-definition tokens for the CEO agent.


Part 1: What We've Already Done Well

OptimizationImpactStatus
Prompt caching (Anthropic models) — stable/dynamic split~90% input savings on cached prefixDone
Message summarization at 30 msgs, keep last 10Prevents unbounded context growthDone
Graph caching with 10-min TTL per sessionAvoids recompilation + DB re-fetchesDone
Procedural memory — single embedding call for all tool learningsO(1) vs O(N tools) embedding lookupsDone
Tool usage summary — prevents repeating failed callsAvoids wasted iterationsDone
Step budget (25 max tool iterations + warning at 22)Prevents runaway loopsDone
Final answer node — forces synthesis when budget exhaustedGraceful degradationDone
Lazy browser init — Chromium deferred until first useNo startup cost if browser unusedDone
Parallel tool/KB/MCP loading via Promise.allSettledLoad time = max, not sumDone
Reflection runs post-stream (fire-and-forget)Non-blocking learningDone
Permission-based tool filtering — agents only get allowed toolsReduces tool count for regular agentsDone

These are genuinely good. The prompt caching split and procedural memory are above-average for most agent platforms.

How prompt caching works today

The system prompt is split into two content blocks in gateway.ts:

  • Block 1 (Stable/Cached): Agent identity, operating principles, capability maps, tool instructions. Tagged with cache_control: { type: "ephemeral" }.
  • Block 2 (Dynamic): Memories, KB results, notebook entries, tool usage history, plan state, step budget warnings. Fresh per turn.

For Anthropic models via OpenRouter, a custom fetch wrapper (createCacheControlFetch()) re-injects cache_control after LangChain strips it. Non-Anthropic models get a single string with automatic caching.

How summarization works today

  • Trigger: messages.length > 30 (SUMMARIZE_THRESHOLD)
  • Behavior: LLM generates a concise summary, deletes old messages via RemoveMessage, keeps last 10 messages
  • Safe split: Walks boundary back to avoid breaking tool call/response pairs
  • Summary persists across summarization cycles in state.summary

How the step budget works today

  • MAX_TOOL_ITERATIONS = 25 — tracked via state.step_count
  • At step >= 22: warning injected into dynamic prompt block
  • At step >= 25: routes to final_answer node (LLM invoked without tools, forced to synthesize)
  • Secondary safety net: recursionLimit: 50 in LangGraph config

Part 2: What's Missing / Scope for Improvement

1. No Tool Result Truncation

Severity: Critical — High Impact

Problem: In agent.graph.ts:2222-2234, tool results are passed to the LLM in full, without any truncation:

1const result = await tool.invoke(tc.args); 2const resultContent = typeof result === "string" ? result : JSON.stringify(result); 3// Full resultContent goes into ToolMessage — NO truncation 4results.push(new ToolMessage({ content: resultContent, ... }));

The .slice(0, 300) at line 2228 is only for logging, not for the actual message sent to the LLM.

Impact: A single Composio tool call returning a full Gmail inbox, a Google Sheet, or a GitHub issue list can easily be 5,000-50,000 tokens. This accumulates in conversation history and gets re-sent on every subsequent LLM call until summarization kicks in at 30 messages.

Fix: Add a configurable max result size. Truncate large tool results with a suffix like "... [truncated, showing first 3000 chars of 45000]". LangChain's docs recommend keeping tool results under ~4,000 tokens.

Estimated savings: 20-40% of total token cost for tool-heavy conversations.


2. No trimMessages Before LLM Invocation

Severity: Critical — High Impact

Problem: There is a custom summarization node that triggers at 30 messages, but the codebase is not using LangChain's built-in trimMessages utility. More importantly, between summarization events (messages 1-30), the full uncompressed history is sent on every single LLM call.

With 25 max tool iterations, each iteration can produce 2+ messages (AI + ToolMessage). That means by iteration 15, there could be 30+ messages — each containing full tool results — all being sent as input tokens on every call.

Fix: Use trimMessages (from @langchain/core/messages) with a token counter before llmWithTools.invoke(). Set a maxTokens budget (e.g., 8,000 tokens for history) and trim from the beginning, preserving tool call/response pairs. This provides per-call protection, complementing the per-session summarization.

1import { trimMessages } from "@langchain/core/messages"; 2 3const trimmed = await trimMessages(sanitizedMessages, { 4 maxTokens: 8000, 5 tokenCounter: llm.getNumTokens.bind(llm), 6 strategy: "last", 7 allowPartial: false, 8}); 9await llmWithTools.invoke([systemMsg, ...trimmed], config);

Estimated savings: 15-30% of total input tokens across multi-step runs.


3. The Composio Instruction Block Is ~4,000 Tokens

Severity: High Impact

Problem: The Composio integration section in system-prompt-builder.ts:280-378 is ~100 lines / ~4,000 tokens of instruction text. It includes:

  • 6-step workflow instructions
  • 7 error handling patterns with specific regex-matched errors
  • 5 efficiency rules
  • 4 user-reference matching rules
  • Learning-from-experience instructions

This is sent on every single LLM call for any agent with even one Composio integration.

Fix options:

  1. Move most of this to a skill/reference doc that the agent can consult, instead of embedding in the system prompt
  2. Condense to ~500 tokens — the LLM already knows how to use tools. Keep only the 6-step flow summary and the critical rules (slug format, don't repeat searches). The verbose error handling table is unnecessary; the LLM can read error messages
  3. Make it dynamic — only include the full block on the first turn, then shrink to a short reminder on subsequent turns

Estimated savings: 3,000-3,500 tokens per call for agents with Composio.


4. Dynamic Tool Selection / Tool Pruning

Severity: High Impact (especially for CEO agent)

Problem: The CEO agent gets all system tools (28) + CEO tools (16) + base tools = ~57-82 tools. Every tool definition costs ~150-350 tokens (name + description + JSON schema). That's ~10,000-15,000 tokens just for tool definitions, sent on every call.

Many of these tools are irrelevant to the current turn. If the user says "what's the status of project X?", the agent doesn't need system_create_kb, system_delete_channel, system_create_schedule, etc.

LangChain explicitly recommends dynamic tool selection in their context engineering docs:

"Not every tool is appropriate for every turn. Too many tools cause selection errors; too few limit capabilities. Dynamic tool selection adapts the available toolset based on context."

Fix options:

  1. Two-phase approach: Use a lightweight/cheap model (nano/flash) to classify the user's intent and select only relevant tool groups (e.g., "project management" → CEO tools only, "create agent" → system tools only)
  2. Tool groups with a meta-tool: Replace 28 system tools with a system_management meta-tool that accepts an action parameter. Internally dispatches to the right sub-tool. Tool definition cost: 1 tool vs 28
  3. LangChain middleware: Use the dynamically selecting tools middleware pattern to prune tools per-turn based on conversation context

Estimated savings: 4,000-8,000 tokens per call for CEO agent.


5. System Prompt Is 46KB / ~12,000 Tokens Base

Severity: Medium Impact

Problem: system-prompt-builder.ts is 46KB (858 lines). Even for a simple agent with minimal capabilities, the base prompt (identity + core behavior + planning + confirmation rules) is ~2,500 tokens. For a CEO with all capabilities enabled, the stable block alone can exceed 10,000-12,000 tokens.

The prompt includes verbose sections that could be condensed:

SectionCurrent Est. TokensCould Be
Browser instructions (extension + cloud)~800~200
Planning/task tracking instructions~400~100
Confirmation rules~300~80
Composio block~4,000~500
Channel instructions~300~80
KB instructions~300~100

Fix:

  • Apply the "minimal instruction principle" — modern LLMs don't need verbose step-by-step instructions for standard patterns
  • Use conditional blocks more aggressively — behavioral instructions within each section are verbose regardless of complexity
  • Target: reduce stable block from ~12K to ~5K tokens

Estimated savings: 5,000-7,000 tokens off the base prompt. Partially offset by caching for Anthropic models, but still affects cache write cost (25% premium) and all non-Anthropic models which don't benefit from the cache split.


6. Tool Descriptions Embedded in System Prompt AND Tool Definitions (Double Counting)

Severity: Medium Impact

Problem: In system-prompt-builder.ts:226-252, function tools are listed with their full descriptions and parameters inside the system prompt text:

1`- **${t.name}**: ${t.description || "No description"} 2 Parameters: ${t.parameters} 3 Returns: ${t.returnDescription || "result"}`

These same tools are also sent as formal tool definitions via bindTools(). The LLM sees every tool twice — once in the system message text, once in the tools array. This is pure duplication.

Fix: Remove the tool capability map from the system prompt. The LLM already sees tool names, descriptions, and schemas from the bound tools. If behavioral guidance is needed ("use tools before browser"), do it in a single sentence, not by re-listing every tool.

Estimated savings: 500-2,000 tokens depending on number of function tools.


7. Browser Tool Count (11 tools) — Consider Consolidation

Severity: Low-Medium Impact

Problem: The browser agent has 11 individual tools with some overlap:

  • click_element vs browser_click (index-based vs selector-based)
  • type_element vs browser_type (same operation, different paradigms)

Fix: Consolidate into fewer tools with optional parameters. E.g., one browser_interact tool with an action enum (click|type|scroll|select) and flexible targeting (index OR selector).

Estimated savings: ~600-800 tokens in tool definitions.


8. No Token-Aware Context Budgeting

Severity: Medium Impact

Problem: The current system has no awareness of how many tokens are actually being consumed. There's no tokenCounter anywhere in the message pipeline. The summarization trigger is purely message count (30), not token count. A conversation with 20 messages where each tool result is 5K tokens = 100K input tokens, but summarization won't trigger because count < 30.

Fix: Implement a token counter (LangChain provides getTokenCounter() from model instances or use tiktoken). Switch summarization trigger to token-based: e.g., summarize when total message tokens exceed 15K. Use trimMessages as the immediate guard, summarization as the background compressor.

Estimated savings: Variable, but prevents worst-case scenarios where 20 messages with huge tool results blow up costs.


Part 3: Quick-Win Priority Matrix

#FixEffortToken Savings Per CallAffects
1Truncate tool results to ~3-4K charsLow (~10 lines of code)20-40% of totalAll agents
2Condense Composio block 4K → 500 tokensLow (rewrite text)~3,000 tokensAgents with integrations
3Remove tool re-listing from system promptLow (delete one block)500-2K tokensAgents with function tools
4Add trimMessages before LLM invokeMedium (~20 lines)15-30% of totalAll agents
5Dynamic tool selection for CEOMedium-High4-8K tokensCEO agent
6Condense overall system prompt verbosityMedium5-7K tokensAll agents
7Token-based summarization triggerMediumPrevents worst-case blowupsAll agents
8Consolidate browser toolsMedium~700 tokensBrowser agents

Recommended implementation order: 1 → 3 → 2 → 4 → 6 → 5 → 7 → 8


Part 4: Estimated Total Impact

CEO Agent (with Composio + Browser + System Tools)

ComponentCurrent EstimateAfter Optimization
System prompt (stable)~12,000 tokens~5,000 tokens
Tool definitions (~82 tools)~15,000 tokens~6,000 tokens
Dynamic block (memories, KB, etc.)~2,000 tokens~2,000 tokens
Message history (avg at iteration 10)~15,000 tokens~8,000 tokens
Total per-call input~44,000 tokens~21,000 tokens

~52% reduction in per-call input tokens, compounding across 10-25 iterations per run.

Normal Agent (~25 tools, no Composio)

ComponentCurrent EstimateAfter Optimization
System prompt (stable)~5,000 tokens~3,000 tokens
Tool definitions (~25 tools)~5,000 tokens~4,000 tokens
Dynamic block~1,500 tokens~1,500 tokens
Message history~10,000 tokens~6,000 tokens
Total per-call input~21,500 tokens~14,500 tokens

~33% reduction for normal agents.

Compounding Effect Across a Full Run

For a CEO agent run that uses 15 tool iterations:

  • Current: ~44K tokens × 15 calls = ~660K input tokens per run
  • Optimized: ~21K tokens × 15 calls = ~315K input tokens per run
  • Savings: ~345K input tokens per run (~52%)

With prompt caching (Anthropic), the cached portion savings are already applied. The numbers above reflect savings on the non-cached portions (tool definitions, message history, dynamic block) which are the majority of the cost.


Part 5: What We're NOT Doing Wrong

To be fair, some things that might seem like issues are not:

  • Prompt caching split — well-implemented, follows Anthropic best practice
  • Routing through OpenRouter — correct for the setup, doesn't add token overhead
  • Summarization approach — keeping 10 messages + summary is a reasonable strategy
  • Procedural memory — single-query semantic search is efficient
  • Fire-and-forget reflection — correct choice, no user-facing latency impact
  • Permission-based filtering — already reduces normal agent tool count
  • Graph caching — avoids expensive recompilation per message
  • Parallel I/O during graph creation — good use of Promise.allSettled

Part 6: Current Architecture Reference

Token Cost Breakdown by Component

1┌─────────────────────────────────────────────────────────┐ 2│ LLM INPUT (per call) │ 3├─────────────────────────────────────────────────────────┤ 4│ │ 5│ ┌─ Tool Definitions ──────────────────────────────┐ │ 6│ │ CEO: ~15,000 tokens (82 tools × ~180 avg) │ │ 7│ │ Normal: ~5,000 tokens (25 tools × ~200 avg) │ │ 8│ └─────────────────────────────────────────────────┘ │ 9│ │ 10│ ┌─ System Prompt (Stable Block - CACHED) ─────────┐ │ 11│ │ Identity + Behavior: ~2,500 tokens │ │ 12│ │ Capability Maps: ~2,000-8,000 tokens │ │ 13│ │ Composio Instructions: ~4,000 tokens │ │ 14│ │ Browser/Channel/Agent: ~1,000-2,000 tokens │ │ 15│ └─────────────────────────────────────────────────┘ │ 16│ │ 17│ ┌─ System Prompt (Dynamic Block - FRESH) ─────────┐ │ 18│ │ Memories: ~100-500 tokens │ │ 19│ │ Notebook: ~100-300 tokens │ │ 20│ │ Plan State: ~100-200 tokens │ │ 21│ │ Tool Usage History: ~200-400 tokens │ │ 22│ │ KB Results: ~300-1,000 tokens │ │ 23│ │ Procedural Memory: ~200-500 tokens │ │ 24│ │ Step Budget Warning: ~50 tokens (if near limit)│ │ 25│ └─────────────────────────────────────────────────┘ │ 26│ │ 27│ ┌─ Message History ───────────────────────────────┐ │ 28│ │ Human messages + AI responses + Tool results │ │ 29│ │ Grows UNBOUNDED until 30-message summarization │ │ 30│ │ Tool results: FULL SIZE (no truncation!) │ │ 31│ │ Avg at iteration 10: ~10,000-20,000 tokens │ │ 32│ └─────────────────────────────────────────────────┘ │ 33│ │ 34└─────────────────────────────────────────────────────────┘

Graph Flow

1__start__ → agent → [routing decision] 2 ├→ tools (if AI requests tools) 3 │ └→ agent (loop back) 4 ├→ summarize_conversation (if messages > 30) 5 │ └→ __end__ 6 ├→ final_answer (if step_count >= 25) 7 │ └→ __end__ 8 └→ __end__ (triggers post-stream reflection)

Tool Categories and Token Cost

CategoryTool CountAvg Tokens/ToolTotal TokensLoaded When
Bucket6~200~1,200Always
Planning3~150~450Always
Python1~400~400Always
Workspace User1~100~100Always
Confirmation1~100~100Always
Memory1~120~120If userId present
Notebook4~150~600If userId present
Vault1~120~120If vault connected
Browser11~200~2,200If browser enabled
System28~250~7,000If systemLevelAccess
CEO16~250~4,000If isCeo
Tester7~250~1,750If isTester
Composio Meta5~300~1,500If integrations exist
Function ToolsVariable~200VariablePer permissions
MCP ToolsVariable~200VariablePer permissions
Agent DelegationVariable~150VariablePer permissions

Sources